Telecom Churn Case Study(Machine learning -II)

Data Loading

In [1]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings

%matplotlib inline

# setting max number of columns to be displayed
pd.set_option('display.max_columns',250)
pd.set_option('display.max_rows',250)

# Supress Warnings
warnings.filterwarnings('ignore')
In [2]:
# lets import the dataset
telecom = pd.read_csv("telecom_churn_data.csv")
telecom.head()
Out[2]:
mobile_number circle_id loc_og_t2o_mou std_og_t2o_mou loc_ic_t2o_mou last_date_of_month_6 last_date_of_month_7 last_date_of_month_8 last_date_of_month_9 arpu_6 arpu_7 arpu_8 arpu_9 onnet_mou_6 onnet_mou_7 onnet_mou_8 onnet_mou_9 offnet_mou_6 offnet_mou_7 offnet_mou_8 offnet_mou_9 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_ic_mou_9 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 roam_og_mou_9 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2t_mou_9 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2m_mou_9 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2f_mou_9 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_t2c_mou_9 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 loc_og_mou_9 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2t_mou_9 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2m_mou_9 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_t2f_mou_9 std_og_t2c_mou_6 std_og_t2c_mou_7 std_og_t2c_mou_8 std_og_t2c_mou_9 std_og_mou_6 std_og_mou_7 std_og_mou_8 std_og_mou_9 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 isd_og_mou_9 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 spl_og_mou_9 og_others_6 og_others_7 og_others_8 og_others_9 total_og_mou_6 total_og_mou_7 total_og_mou_8 total_og_mou_9 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2t_mou_9 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2m_mou_9 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_t2f_mou_9 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 loc_ic_mou_9 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2t_mou_9 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2m_mou_9 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_t2f_mou_9 std_ic_t2o_mou_6 std_ic_t2o_mou_7 std_ic_t2o_mou_8 std_ic_t2o_mou_9 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 std_ic_mou_9 total_ic_mou_6 total_ic_mou_7 total_ic_mou_8 total_ic_mou_9 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 spl_ic_mou_9 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 isd_ic_mou_9 ic_others_6 ic_others_7 ic_others_8 ic_others_9 total_rech_num_6 total_rech_num_7 total_rech_num_8 total_rech_num_9 total_rech_amt_6 total_rech_amt_7 total_rech_amt_8 total_rech_amt_9 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 max_rech_amt_9 date_of_last_rech_6 date_of_last_rech_7 date_of_last_rech_8 date_of_last_rech_9 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 last_day_rch_amt_9 date_of_last_rech_data_6 date_of_last_rech_data_7 date_of_last_rech_data_8 date_of_last_rech_data_9 total_rech_data_6 total_rech_data_7 total_rech_data_8 total_rech_data_9 max_rech_data_6 max_rech_data_7 max_rech_data_8 max_rech_data_9 count_rech_2g_6 count_rech_2g_7 count_rech_2g_8 count_rech_2g_9 count_rech_3g_6 count_rech_3g_7 count_rech_3g_8 count_rech_3g_9 av_rech_amt_data_6 av_rech_amt_data_7 av_rech_amt_data_8 av_rech_amt_data_9 vol_2g_mb_6 vol_2g_mb_7 vol_2g_mb_8 vol_2g_mb_9 vol_3g_mb_6 vol_3g_mb_7 vol_3g_mb_8 vol_3g_mb_9 arpu_3g_6 arpu_3g_7 arpu_3g_8 arpu_3g_9 arpu_2g_6 arpu_2g_7 arpu_2g_8 arpu_2g_9 night_pck_user_6 night_pck_user_7 night_pck_user_8 night_pck_user_9 monthly_2g_6 monthly_2g_7 monthly_2g_8 monthly_2g_9 sachet_2g_6 sachet_2g_7 sachet_2g_8 sachet_2g_9 monthly_3g_6 monthly_3g_7 monthly_3g_8 monthly_3g_9 sachet_3g_6 sachet_3g_7 sachet_3g_8 sachet_3g_9 fb_user_6 fb_user_7 fb_user_8 fb_user_9 aon aug_vbc_3g jul_vbc_3g jun_vbc_3g sep_vbc_3g
0 7000842753 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 197.385 214.816 213.803 21.100 NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN 0.00 0.00 0.00 0.00 NaN NaN 0.16 NaN NaN NaN 4.13 NaN NaN NaN 1.15 NaN NaN NaN 5.44 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0.00 NaN 0.00 0.00 5.44 0.00 NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN 4 3 2 6 362 252 252 0 252 252 252 0 6/21/2014 7/16/2014 8/8/2014 9/28/2014 252 252 252 0 6/21/2014 7/16/2014 8/8/2014 NaN 1.0 1.0 1.0 NaN 252.0 252.0 252.0 NaN 0.0 0.0 0.0 NaN 1.0 1.0 1.0 NaN 252.0 252.0 252.0 NaN 30.13 1.32 5.75 0.0 83.57 150.76 109.61 0.00 212.17 212.17 212.17 NaN 212.17 212.17 212.17 NaN 0.0 0.0 0.0 NaN 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1.0 1.0 1.0 NaN 968 30.4 0.0 101.20 3.58
1 7001865778 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 34.047 355.074 268.321 86.285 24.11 78.68 7.68 18.34 15.74 99.84 304.76 53.76 0.0 0.00 0.00 0.00 0.0 0.00 0.00 0.00 23.88 74.56 7.68 18.34 11.51 75.94 291.86 53.76 0.00 0.00 0.00 0.00 0.0 2.91 0.00 0.00 35.39 150.51 299.54 72.11 0.23 4.11 0.00 0.00 0.00 0.46 0.13 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.23 4.58 0.13 0.00 0.0 0.0 0.0 0.0 4.68 23.43 12.76 0.00 0.00 0.0 0.0 0.0 40.31 178.53 312.44 72.11 1.61 29.91 29.23 116.09 17.48 65.38 375.58 56.93 0.00 8.93 3.61 0.00 19.09 104.23 408.43 173.03 0.00 0.00 2.35 0.00 5.90 0.00 12.49 15.01 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 5.90 0.00 14.84 15.01 26.83 104.23 423.28 188.04 0.00 0.0 0.0 0.00 1.83 0.00 0.0 0.00 0.00 0.00 0.0 0.00 4 9 11 5 74 384 283 121 44 154 65 50 6/29/2014 7/31/2014 8/28/2014 9/30/2014 44 23 30 0 NaN 7/25/2014 8/10/2014 NaN NaN 1.0 2.0 NaN NaN 154.0 25.0 NaN NaN 1.0 2.0 NaN NaN 0.0 0.0 NaN NaN 154.0 50.0 NaN 0.00 108.07 365.47 0.0 0.00 0.00 0.00 0.00 NaN 0.00 0.00 NaN NaN 28.61 7.60 NaN NaN 0.0 0.0 NaN 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 NaN 1.0 1.0 NaN 1006 0.0 0.0 0.00 0.00
2 7001625959 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 167.690 189.058 210.226 290.714 11.54 55.24 37.26 74.81 143.33 220.59 208.36 118.91 0.0 0.00 0.00 38.49 0.0 0.00 0.00 70.94 7.19 28.74 13.58 14.39 29.34 16.86 38.46 28.16 24.11 21.79 15.61 22.24 0.0 135.54 45.76 0.48 60.66 67.41 67.66 64.81 4.34 26.49 22.58 8.76 41.81 67.41 75.53 9.28 1.48 14.76 22.83 0.0 0.0 0.0 0.0 0.0 47.64 108.68 120.94 18.04 0.0 0.0 0.0 0.0 46.56 236.84 96.84 42.08 0.45 0.0 0.0 0.0 155.33 412.94 285.46 124.94 115.69 71.11 67.46 148.23 14.38 15.44 38.89 38.98 99.48 122.29 49.63 158.19 229.56 208.86 155.99 345.41 72.41 71.29 28.69 49.44 45.18 177.01 167.09 118.18 21.73 58.34 43.23 3.86 0.0 0.0 0.0 0.0 139.33 306.66 239.03 171.49 370.04 519.53 395.03 517.74 0.21 0.0 0.0 0.45 0.00 0.85 0.0 0.01 0.93 3.14 0.0 0.36 5 4 2 7 168 315 116 358 86 200 86 100 6/17/2014 7/24/2014 8/14/2014 9/29/2014 0 200 86 0 NaN NaN NaN 9/17/2014 NaN NaN NaN 1.0 NaN NaN NaN 46.0 NaN NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN NaN 46.0 0.00 0.00 0.00 0.0 0.00 0.00 0.00 8.42 NaN NaN NaN 2.84 NaN NaN NaN 0.0 NaN NaN NaN 0.0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 NaN NaN NaN 1.0 1103 0.0 0.0 4.17 0.00
3 7001204172 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 221.338 251.102 508.054 389.500 99.91 54.39 310.98 241.71 123.31 109.01 71.68 113.54 0.0 54.86 44.38 0.00 0.0 28.09 39.04 0.00 73.68 34.81 10.61 15.49 107.43 83.21 22.46 65.46 1.91 0.65 4.91 2.06 0.0 0.00 0.00 0.00 183.03 118.68 37.99 83.03 26.23 14.89 289.58 226.21 2.99 1.73 6.53 9.99 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 29.23 16.63 296.11 236.21 0.0 0.0 0.0 0.0 10.96 0.00 18.09 43.29 0.00 0.0 0.0 0.0 223.23 135.31 352.21 362.54 62.08 19.98 8.04 41.73 113.96 64.51 20.28 52.86 57.43 27.09 19.84 65.59 233.48 111.59 48.18 160.19 43.48 66.44 0.00 129.84 1.33 38.56 4.94 13.98 1.18 0.00 0.00 0.00 0.0 0.0 0.0 0.0 45.99 105.01 4.94 143.83 280.08 216.61 53.13 305.38 0.59 0.0 0.0 0.55 0.00 0.00 0.0 0.00 0.00 0.00 0.0 0.80 10 11 18 14 230 310 601 410 60 50 50 50 6/28/2014 7/31/2014 8/31/2014 9/30/2014 30 50 50 30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN 2491 0.0 0.0 0.00 0.00
4 7000142493 109 0.0 0.0 0.0 6/30/2014 7/31/2014 8/31/2014 9/30/2014 261.636 309.876 238.174 163.426 50.31 149.44 83.89 58.78 76.96 91.88 124.26 45.81 0.0 0.00 0.00 0.00 0.0 0.00 0.00 0.00 50.31 149.44 83.89 58.78 67.64 91.88 124.26 37.89 0.00 0.00 0.00 1.93 0.0 0.00 0.00 0.00 117.96 241.33 208.16 98.61 0.00 0.00 0.00 0.00 9.31 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 9.31 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00 5.98 0.00 0.0 0.0 0.0 127.28 241.33 208.16 104.59 105.68 88.49 233.81 154.56 106.84 109.54 104.13 48.24 1.50 0.00 0.00 0.00 214.03 198.04 337.94 202.81 0.00 0.00 0.86 2.31 1.93 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 1.93 0.25 0.86 2.31 216.44 198.29 338.81 205.31 0.00 0.0 0.0 0.18 0.00 0.00 0.0 0.00 0.48 0.00 0.0 0.00 5 6 3 4 196 350 287 200 56 110 110 50 6/26/2014 7/28/2014 8/9/2014 9/28/2014 50 110 110 50 6/4/2014 NaN NaN NaN 1.0 NaN NaN NaN 56.0 NaN NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN NaN 56.0 NaN NaN NaN 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN 0.00 NaN NaN NaN 0.0 NaN NaN NaN 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0.0 NaN NaN NaN 1526 0.0 0.0 0.00 0.00
In [3]:
# Let's check the dimensions of the dataframe
telecom.shape
Out[3]:
(99999, 226)
In [4]:
telecom.describe(include='all')
Out[4]:
mobile_number circle_id loc_og_t2o_mou std_og_t2o_mou loc_ic_t2o_mou last_date_of_month_6 last_date_of_month_7 last_date_of_month_8 last_date_of_month_9 arpu_6 arpu_7 arpu_8 arpu_9 onnet_mou_6 onnet_mou_7 onnet_mou_8 onnet_mou_9 offnet_mou_6 offnet_mou_7 offnet_mou_8 offnet_mou_9 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_ic_mou_9 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 roam_og_mou_9 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2t_mou_9 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2m_mou_9 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2f_mou_9 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_t2c_mou_9 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 loc_og_mou_9 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2t_mou_9 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2m_mou_9 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_t2f_mou_9 std_og_t2c_mou_6 std_og_t2c_mou_7 std_og_t2c_mou_8 std_og_t2c_mou_9 std_og_mou_6 std_og_mou_7 std_og_mou_8 std_og_mou_9 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 isd_og_mou_9 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 spl_og_mou_9 og_others_6 og_others_7 og_others_8 og_others_9 total_og_mou_6 total_og_mou_7 total_og_mou_8 total_og_mou_9 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2t_mou_9 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2m_mou_9 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_t2f_mou_9 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 loc_ic_mou_9 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2t_mou_9 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2m_mou_9 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_t2f_mou_9 std_ic_t2o_mou_6 std_ic_t2o_mou_7 std_ic_t2o_mou_8 std_ic_t2o_mou_9 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 std_ic_mou_9 total_ic_mou_6 total_ic_mou_7 total_ic_mou_8 total_ic_mou_9 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 spl_ic_mou_9 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 isd_ic_mou_9 ic_others_6 ic_others_7 ic_others_8 ic_others_9 total_rech_num_6 total_rech_num_7 total_rech_num_8 total_rech_num_9 total_rech_amt_6 total_rech_amt_7 total_rech_amt_8 total_rech_amt_9 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 max_rech_amt_9 date_of_last_rech_6 date_of_last_rech_7 date_of_last_rech_8 date_of_last_rech_9 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 last_day_rch_amt_9 date_of_last_rech_data_6 date_of_last_rech_data_7 date_of_last_rech_data_8 date_of_last_rech_data_9 total_rech_data_6 total_rech_data_7 total_rech_data_8 total_rech_data_9 max_rech_data_6 max_rech_data_7 max_rech_data_8 max_rech_data_9 count_rech_2g_6 count_rech_2g_7 count_rech_2g_8 count_rech_2g_9 count_rech_3g_6 count_rech_3g_7 count_rech_3g_8 count_rech_3g_9 av_rech_amt_data_6 av_rech_amt_data_7 av_rech_amt_data_8 av_rech_amt_data_9 vol_2g_mb_6 vol_2g_mb_7 vol_2g_mb_8 vol_2g_mb_9 vol_3g_mb_6 vol_3g_mb_7 vol_3g_mb_8 vol_3g_mb_9 arpu_3g_6 arpu_3g_7 arpu_3g_8 arpu_3g_9 arpu_2g_6 arpu_2g_7 arpu_2g_8 arpu_2g_9 night_pck_user_6 night_pck_user_7 night_pck_user_8 night_pck_user_9 monthly_2g_6 monthly_2g_7 monthly_2g_8 monthly_2g_9 sachet_2g_6 sachet_2g_7 sachet_2g_8 sachet_2g_9 monthly_3g_6 monthly_3g_7 monthly_3g_8 monthly_3g_9 sachet_3g_6 sachet_3g_7 sachet_3g_8 sachet_3g_9 fb_user_6 fb_user_7 fb_user_8 fb_user_9 aon aug_vbc_3g jul_vbc_3g jun_vbc_3g sep_vbc_3g
count 9.999900e+04 99999.0 98981.0 98981.0 98981.0 99999 99398 98899 98340 99999.000000 99999.000000 99999.000000 99999.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.0 96140.0 94621.0 92254.0 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 99999.000000 99999.000000 99999.000000 99999.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.0 96140.0 94621.0 92254.0 96062.000000 96140.000000 94621.000000 92254.000000 99999.000000 99999.000000 99999.000000 99999.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 96062.000000 96140.000000 94621.000000 92254.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 98392 98232 96377 95239 99999.000000 99999.000000 99999.000000 99999.000000 25153 25571 26339 25922 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.00000 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000 25153.000000 25571.000000 26339.000000 25922.000000 99999.000000 99999.000000 99999.000000 99999.000000 99999.000000
unique NaN NaN NaN NaN NaN 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 30 31 31 30 NaN NaN NaN NaN 30 31 31 30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN 6/30/2014 7/31/2014 8/31/2014 9/30/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6/30/2014 7/31/2014 8/31/2014 9/29/2014 NaN NaN NaN NaN 6/30/2014 7/31/2014 8/31/2014 9/29/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN 99999 99398 98899 98340 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16960 17288 14706 22623 NaN NaN NaN NaN 1888 1813 1998 2329 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 7.001207e+09 109.0 0.0 0.0 0.0 NaN NaN NaN NaN 282.987358 278.536648 279.154731 261.645069 132.395875 133.670805 133.018098 130.302327 197.935577 197.045133 196.574803 190.337222 9.950013 7.149898 7.292981 6.343841 13.911337 9.818732 9.971890 8.555519 47.100763 46.473010 45.887806 44.584446 93.342088 91.397131 91.755128 90.463192 3.751013 3.792985 3.677991 3.655123 1.123056 1.368500 1.433821 1.232726 144.201175 141.670476 141.328209 138.709970 79.829870 83.299598 83.282673 82.342919 87.299624 90.804137 89.838390 86.276622 1.129011 1.115010 1.067792 1.042362 0.0 0.0 0.0 0.0 168.261218 175.221436 174.191498 169.664466 0.798277 0.776572 0.791247 0.723892 3.916811 4.978279 5.053769 4.412767 0.454157 0.030235 0.033372 0.047456 305.133424 310.231175 304.119513 289.279198 47.922365 47.990520 47.211362 46.281794 107.475650 107.120493 108.460515 106.155471 12.084305 12.599697 11.751834 12.173105 167.491059 167.719540 167.432575 164.619293 9.575993 10.011904 9.883921 9.432479 20.722240 21.656415 21.183211 19.620913 2.156397 2.216923 2.085004 2.173419 0.0 0.0 0.0 0.0 32.457179 33.887833 33.154735 31.229344 200.130037 202.853055 198.750783 189.214260 0.061557 0.033585 0.040361 0.163137 7.460608 8.334936 8.442001 8.063003 0.854656 1.012960 0.970800 1.017162 7.558806 7.700367 7.212912 6.893019 327.514615 322.962970 324.157122 303.345673 104.637486 104.752398 107.728207 101.943889 NaN NaN NaN NaN 63.156252 59.385804 62.641716 43.901249 NaN NaN NaN NaN 2.463802 2.666419 2.651999 2.441170 126.393392 126.729459 125.717301 124.94144 1.864668 2.044699 2.016288 1.781807 0.599133 0.621720 0.635711 0.659363 192.600982 200.981292 197.526489 192.734315 51.904956 51.229937 50.170154 44.719701 121.396219 128.995847 135.410689 136.056613 89.555057 89.384120 91.173849 100.264116 86.398003 85.914450 86.599478 93.712026 0.025086 0.023034 0.020844 0.015971 0.079641 0.083221 0.081001 0.068781 0.389384 0.439634 0.450075 0.393104 0.075921 0.078581 0.082941 0.086341 0.074781 0.080401 0.084501 0.084581 0.914404 0.908764 0.890808 0.860968 1219.854749 68.170248 66.839062 60.021204 3.299373
std 6.956694e+05 0.0 0.0 0.0 0.0 NaN NaN NaN NaN 328.439770 338.156291 344.474791 341.998630 297.207406 308.794148 308.951589 308.477668 316.851613 325.862803 327.170662 319.396092 72.825411 73.447948 68.402466 57.137537 71.443196 58.455762 64.713221 58.438186 150.856393 155.318705 151.184830 147.995390 162.780544 157.492308 156.537048 158.681454 14.230438 14.264986 13.270996 13.457549 5.448946 7.533445 6.783335 5.619021 251.751489 248.731086 245.914311 245.934517 252.476533 263.631042 265.486090 267.184991 255.617850 269.347911 271.757783 261.407396 7.984970 8.599406 7.905971 8.261770 0.0 0.0 0.0 0.0 389.948499 408.922934 411.633049 405.138658 25.765248 25.603052 25.544471 21.310751 14.936449 20.661570 17.855111 16.328227 4.125911 2.161717 2.323464 3.635466 463.419481 480.031178 478.150031 468.980002 140.258485 145.795055 137.239552 140.130610 171.713903 169.423620 169.723759 165.492803 40.140895 42.977442 39.125379 43.840776 254.124029 256.242707 250.025523 249.845070 54.330607 57.411971 55.073186 53.376273 80.793414 86.521393 83.683565 74.913050 16.495594 16.454061 15.812580 15.978601 0.0 0.0 0.0 0.0 106.283386 113.720168 110.127008 101.982303 291.651671 298.124954 289.321094 284.823024 0.160920 0.155725 0.146147 0.527860 59.722948 65.219829 63.813098 63.505379 11.955164 12.673099 13.284348 12.381172 7.078405 7.070422 7.203753 7.096261 398.019701 408.114237 416.540455 404.588583 120.614894 124.523970 126.902505 125.375109 NaN NaN NaN NaN 97.356649 95.915385 104.431816 90.809712 NaN NaN NaN NaN 2.789128 3.031593 3.074987 2.516339 108.477235 109.765267 109.437851 111.36376 2.570254 2.768332 2.720132 2.214701 1.274428 1.394524 1.422827 1.411513 192.646318 196.791224 191.301305 188.400286 213.356445 212.302217 212.347892 198.653570 544.247227 541.494013 558.775335 577.394194 193.124653 195.893924 188.180936 216.291992 172.767523 176.379871 168.247852 171.384224 0.156391 0.150014 0.142863 0.125366 0.295058 0.304395 0.299568 0.278120 1.497320 1.636230 1.630263 1.347140 0.363371 0.387231 0.384947 0.384978 0.568344 0.628334 0.660234 0.650457 0.279772 0.287950 0.311885 0.345987 954.733842 267.580450 271.201856 253.938223 32.408353
min 7.000000e+09 109.0 0.0 0.0 0.0 NaN NaN NaN NaN -2258.709000 -2014.045000 -945.808000 -1899.505000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.500000 0.500000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -30.820000 -26.040000 -24.490000 -71.090000 -35.830000 -15.480000 -55.830000 -45.740000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 180.000000 0.000000 0.000000 0.000000 0.000000
25% 7.000606e+09 109.0 0.0 0.0 0.0 NaN NaN NaN NaN 93.411500 86.980500 84.126000 62.685000 7.380000 6.660000 6.460000 5.330000 34.730000 32.190000 31.630000 27.130000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.660000 1.630000 1.600000 1.360000 9.880000 10.025000 9.810000 8.810000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 17.110000 17.480000 17.110000 15.560000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 44.740000 43.010000 38.580000 25.510000 2.990000 3.230000 3.280000 3.290000 17.290000 18.590000 18.930000 18.560000 0.000000 0.000000 0.000000 0.000000 30.390000 32.460000 32.740000 32.290000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.010000 0.000000 38.530000 41.190000 38.290000 32.370000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 3.000000 3.000000 3.000000 109.000000 100.000000 90.000000 52.000000 30.000000 30.000000 30.000000 28.000000 NaN NaN NaN NaN 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN 1.000000 1.000000 1.000000 1.000000 25.000000 25.000000 25.000000 25.00000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 82.000000 92.000000 87.000000 69.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 467.000000 0.000000 0.000000 0.000000 0.000000
50% 7.001205e+09 109.0 0.0 0.0 0.0 NaN NaN NaN NaN 197.704000 191.640000 192.080000 176.849000 34.310000 32.330000 32.360000 29.840000 96.310000 91.735000 92.140000 87.290000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 11.910000 11.610000 11.730000 11.260000 41.030000 40.430000 40.360000 39.120000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 65.110000 63.685000 63.730000 61.840000 0.000000 0.000000 0.000000 0.000000 3.950000 3.635000 3.310000 2.500000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 11.640000 11.090000 10.410000 8.410000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 145.140000 141.530000 138.610000 125.460000 15.690000 15.740000 16.030000 15.660000 56.490000 57.080000 58.240000 56.610000 0.880000 0.930000 0.930000 0.960000 92.160000 92.550000 93.830000 91.640000 0.000000 0.000000 0.000000 0.000000 2.030000 2.040000 2.030000 1.740000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 5.890000 5.960000 5.880000 5.380000 114.740000 116.340000 114.660000 105.890000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 6.000000 5.000000 5.000000 230.000000 220.000000 225.000000 200.000000 110.000000 110.000000 98.000000 61.000000 NaN NaN NaN NaN 30.000000 30.000000 30.000000 0.000000 NaN NaN NaN NaN 1.000000 1.000000 1.000000 2.000000 145.000000 145.000000 145.000000 145.00000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 154.000000 154.000000 154.000000 164.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.480000 0.420000 0.880000 2.605000 10.830000 8.810000 9.270000 14.800000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 863.000000 0.000000 0.000000 0.000000 0.000000
75% 7.001812e+09 109.0 0.0 0.0 0.0 NaN NaN NaN NaN 371.060000 365.344500 369.370500 353.466500 118.740000 115.595000 115.860000 112.130000 231.860000 226.815000 228.260000 220.505000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 40.960000 39.910000 40.110000 39.280000 110.390000 107.560000 109.090000 106.810000 2.080000 2.090000 2.040000 1.940000 0.000000 0.000000 0.000000 0.000000 168.270000 164.382500 166.110000 162.225000 30.807500 31.132500 30.580000 28.230000 53.290000 54.040000 52.490000 48.560000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 144.837500 150.615000 147.940000 142.105000 0.000000 0.000000 0.000000 0.000000 2.430000 3.710000 3.990000 3.230000 0.000000 0.000000 0.000000 0.000000 372.860000 378.570000 369.900000 353.480000 46.840000 45.810000 46.290000 45.180000 132.387500 130.960000 133.930000 130.490000 8.140000 8.282500 8.110000 8.140000 208.075000 205.837500 207.280000 202.737500 4.060000 4.230000 4.080000 3.510000 15.030000 15.740000 15.360000 14.260000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 26.930000 28.310000 27.710000 25.690000 251.670000 250.660000 248.990000 236.320000 0.000000 0.000000 0.000000 0.060000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 10.000000 9.000000 9.000000 437.500000 428.000000 434.500000 415.000000 120.000000 128.000000 144.000000 144.000000 NaN NaN NaN NaN 110.000000 110.000000 130.000000 50.000000 NaN NaN NaN NaN 3.000000 3.000000 3.000000 3.000000 177.000000 177.000000 179.000000 179.00000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 1.000000 1.000000 252.000000 252.000000 252.000000 252.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 122.070000 119.560000 122.070000 140.010000 122.070000 122.070000 122.070000 140.010000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1807.500000 0.000000 0.000000 0.000000 0.000000
max 7.002411e+09 109.0 0.0 0.0 0.0 NaN NaN NaN NaN 27731.088000 35145.834000 33543.624000 38805.617000 7376.710000 8157.780000 10752.560000 10427.460000 8362.360000 9667.130000 14007.340000 10310.760000 13724.380000 15371.040000 13095.360000 8464.030000 3775.110000 2812.040000 5337.040000 4428.460000 6431.330000 7400.660000 10752.560000 10389.240000 4729.740000 4557.140000 4961.330000 4429.880000 1466.030000 1196.430000 928.490000 927.410000 342.860000 916.240000 502.090000 339.840000 10643.380000 7674.780000 11039.910000 11099.260000 7366.580000 8133.660000 8014.430000 9382.580000 8314.760000 9284.740000 13950.040000 10223.430000 628.560000 544.630000 516.910000 808.490000 0.0 0.0 0.0 0.0 8432.990000 10936.730000 13980.060000 11495.310000 5900.660000 5490.280000 5681.540000 4244.530000 1023.210000 2372.510000 1390.880000 1635.710000 800.890000 370.130000 394.930000 787.790000 10674.030000 11365.310000 14043.060000 11517.730000 6626.930000 9324.660000 10696.230000 10598.830000 4693.860000 4455.830000 6274.190000 5463.780000 1872.340000 1983.010000 2433.060000 4318.280000 7454.630000 9669.910000 10830.160000 10796.290000 5459.560000 5800.930000 4309.290000 3819.830000 5647.160000 6141.880000 5645.860000 5689.760000 1351.110000 1136.080000 1394.890000 1431.960000 0.0 0.0 0.0 0.0 5712.110000 6745.760000 5957.140000 5956.660000 7716.140000 9699.010000 10830.380000 10796.590000 19.760000 21.330000 16.860000 62.380000 6789.410000 5289.540000 4127.010000 5057.740000 1362.940000 1495.940000 2327.510000 1005.230000 307.000000 138.000000 196.000000 131.000000 35190.000000 40335.000000 45320.000000 37235.000000 4010.000000 4010.000000 4449.000000 3399.000000 NaN NaN NaN NaN 4010.000000 4010.000000 4449.000000 3399.000000 NaN NaN NaN NaN 61.000000 54.000000 60.000000 84.000000 1555.000000 1555.000000 1555.000000 1555.00000 42.000000 48.000000 44.000000 40.000000 29.000000 35.000000 45.000000 49.000000 7546.000000 4365.000000 4076.000000 4061.000000 10285.900000 7873.550000 11117.610000 8993.950000 45735.400000 28144.120000 30036.060000 39221.270000 6362.280000 4980.900000 3716.900000 13884.310000 6433.760000 4809.360000 3483.170000 3467.170000 1.000000 1.000000 1.000000 1.000000 4.000000 5.000000 5.000000 4.000000 42.000000 48.000000 44.000000 40.000000 14.000000 16.000000 16.000000 11.000000 29.000000 35.000000 41.000000 49.000000 1.000000 1.000000 1.000000 1.000000 4337.000000 12916.220000 9165.600000 11166.210000 2618.570000
In [5]:
telecom.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Columns: 226 entries, mobile_number to sep_vbc_3g
dtypes: float64(179), int64(35), object(12)
memory usage: 167.8+ MB

There are 99999 rows and 226 columns in the data. Lot of the columns are numeric type, but we need to inspect which are the categorical columns.

In [6]:
# create column name list by types of columns
id_cols = ['mobile_number', 'circle_id']

date_cols = ['last_date_of_month_6',
             'last_date_of_month_7',
             'last_date_of_month_8',
             'last_date_of_month_9',
             'date_of_last_rech_6',
             'date_of_last_rech_7',
             'date_of_last_rech_8',
             'date_of_last_rech_9',
             'date_of_last_rech_data_6',
             'date_of_last_rech_data_7',
             'date_of_last_rech_data_8',
             'date_of_last_rech_data_9'
            ]

cat_cols =  ['night_pck_user_6',
             'night_pck_user_7',
             'night_pck_user_8',
             'night_pck_user_9',
             'fb_user_6',
             'fb_user_7',
             'fb_user_8',
             'fb_user_9'
            ]

num_cols = [column for column in telecom.columns if column not in id_cols + date_cols + cat_cols]

# print the number of columns in each list
print("#ID cols: %d\n#Date cols:%d\n#Numeric cols:%d\n#Category cols:%d" % (len(id_cols), len(date_cols), len(num_cols), len(cat_cols)))

# check if we have missed any column or not
print(len(id_cols) + len(date_cols) + len(num_cols) + len(cat_cols) == telecom.shape[1])
#ID cols: 2
#Date cols:12
#Numeric cols:204
#Category cols:8
True

Handling missing values

In [7]:
# look at missing value ratio in each column
telecom.isnull().sum()*100/telecom.shape[0]### Impute missing values#### Imputing with zeroes
Out[7]:
mobile_number                0.000000
circle_id                    0.000000
loc_og_t2o_mou               1.018010
std_og_t2o_mou               1.018010
loc_ic_t2o_mou               1.018010
last_date_of_month_6         0.000000
last_date_of_month_7         0.601006
last_date_of_month_8         1.100011
last_date_of_month_9         1.659017
arpu_6                       0.000000
arpu_7                       0.000000
arpu_8                       0.000000
arpu_9                       0.000000
onnet_mou_6                  3.937039
onnet_mou_7                  3.859039
onnet_mou_8                  5.378054
onnet_mou_9                  7.745077
offnet_mou_6                 3.937039
offnet_mou_7                 3.859039
offnet_mou_8                 5.378054
offnet_mou_9                 7.745077
roam_ic_mou_6                3.937039
roam_ic_mou_7                3.859039
roam_ic_mou_8                5.378054
roam_ic_mou_9                7.745077
roam_og_mou_6                3.937039
roam_og_mou_7                3.859039
roam_og_mou_8                5.378054
roam_og_mou_9                7.745077
loc_og_t2t_mou_6             3.937039
loc_og_t2t_mou_7             3.859039
loc_og_t2t_mou_8             5.378054
loc_og_t2t_mou_9             7.745077
loc_og_t2m_mou_6             3.937039
loc_og_t2m_mou_7             3.859039
loc_og_t2m_mou_8             5.378054
loc_og_t2m_mou_9             7.745077
loc_og_t2f_mou_6             3.937039
loc_og_t2f_mou_7             3.859039
loc_og_t2f_mou_8             5.378054
loc_og_t2f_mou_9             7.745077
loc_og_t2c_mou_6             3.937039
loc_og_t2c_mou_7             3.859039
loc_og_t2c_mou_8             5.378054
loc_og_t2c_mou_9             7.745077
loc_og_mou_6                 3.937039
loc_og_mou_7                 3.859039
loc_og_mou_8                 5.378054
loc_og_mou_9                 7.745077
std_og_t2t_mou_6             3.937039
std_og_t2t_mou_7             3.859039
std_og_t2t_mou_8             5.378054
std_og_t2t_mou_9             7.745077
std_og_t2m_mou_6             3.937039
std_og_t2m_mou_7             3.859039
std_og_t2m_mou_8             5.378054
std_og_t2m_mou_9             7.745077
std_og_t2f_mou_6             3.937039
std_og_t2f_mou_7             3.859039
std_og_t2f_mou_8             5.378054
std_og_t2f_mou_9             7.745077
std_og_t2c_mou_6             3.937039
std_og_t2c_mou_7             3.859039
std_og_t2c_mou_8             5.378054
std_og_t2c_mou_9             7.745077
std_og_mou_6                 3.937039
std_og_mou_7                 3.859039
std_og_mou_8                 5.378054
std_og_mou_9                 7.745077
isd_og_mou_6                 3.937039
isd_og_mou_7                 3.859039
isd_og_mou_8                 5.378054
isd_og_mou_9                 7.745077
spl_og_mou_6                 3.937039
spl_og_mou_7                 3.859039
spl_og_mou_8                 5.378054
spl_og_mou_9                 7.745077
og_others_6                  3.937039
og_others_7                  3.859039
og_others_8                  5.378054
og_others_9                  7.745077
total_og_mou_6               0.000000
total_og_mou_7               0.000000
total_og_mou_8               0.000000
total_og_mou_9               0.000000
loc_ic_t2t_mou_6             3.937039
loc_ic_t2t_mou_7             3.859039
loc_ic_t2t_mou_8             5.378054
loc_ic_t2t_mou_9             7.745077
loc_ic_t2m_mou_6             3.937039
loc_ic_t2m_mou_7             3.859039
loc_ic_t2m_mou_8             5.378054
loc_ic_t2m_mou_9             7.745077
loc_ic_t2f_mou_6             3.937039
loc_ic_t2f_mou_7             3.859039
loc_ic_t2f_mou_8             5.378054
loc_ic_t2f_mou_9             7.745077
loc_ic_mou_6                 3.937039
loc_ic_mou_7                 3.859039
loc_ic_mou_8                 5.378054
loc_ic_mou_9                 7.745077
std_ic_t2t_mou_6             3.937039
std_ic_t2t_mou_7             3.859039
std_ic_t2t_mou_8             5.378054
std_ic_t2t_mou_9             7.745077
std_ic_t2m_mou_6             3.937039
std_ic_t2m_mou_7             3.859039
std_ic_t2m_mou_8             5.378054
std_ic_t2m_mou_9             7.745077
std_ic_t2f_mou_6             3.937039
std_ic_t2f_mou_7             3.859039
std_ic_t2f_mou_8             5.378054
std_ic_t2f_mou_9             7.745077
std_ic_t2o_mou_6             3.937039
std_ic_t2o_mou_7             3.859039
std_ic_t2o_mou_8             5.378054
std_ic_t2o_mou_9             7.745077
std_ic_mou_6                 3.937039
std_ic_mou_7                 3.859039
std_ic_mou_8                 5.378054
std_ic_mou_9                 7.745077
total_ic_mou_6               0.000000
total_ic_mou_7               0.000000
total_ic_mou_8               0.000000
total_ic_mou_9               0.000000
spl_ic_mou_6                 3.937039
spl_ic_mou_7                 3.859039
spl_ic_mou_8                 5.378054
spl_ic_mou_9                 7.745077
isd_ic_mou_6                 3.937039
isd_ic_mou_7                 3.859039
isd_ic_mou_8                 5.378054
isd_ic_mou_9                 7.745077
ic_others_6                  3.937039
ic_others_7                  3.859039
ic_others_8                  5.378054
ic_others_9                  7.745077
total_rech_num_6             0.000000
total_rech_num_7             0.000000
total_rech_num_8             0.000000
total_rech_num_9             0.000000
total_rech_amt_6             0.000000
total_rech_amt_7             0.000000
total_rech_amt_8             0.000000
total_rech_amt_9             0.000000
max_rech_amt_6               0.000000
max_rech_amt_7               0.000000
max_rech_amt_8               0.000000
max_rech_amt_9               0.000000
date_of_last_rech_6          1.607016
date_of_last_rech_7          1.767018
date_of_last_rech_8          3.622036
date_of_last_rech_9          4.760048
last_day_rch_amt_6           0.000000
last_day_rch_amt_7           0.000000
last_day_rch_amt_8           0.000000
last_day_rch_amt_9           0.000000
date_of_last_rech_data_6    74.846748
date_of_last_rech_data_7    74.428744
date_of_last_rech_data_8    73.660737
date_of_last_rech_data_9    74.077741
total_rech_data_6           74.846748
total_rech_data_7           74.428744
total_rech_data_8           73.660737
total_rech_data_9           74.077741
max_rech_data_6             74.846748
max_rech_data_7             74.428744
max_rech_data_8             73.660737
max_rech_data_9             74.077741
count_rech_2g_6             74.846748
count_rech_2g_7             74.428744
count_rech_2g_8             73.660737
count_rech_2g_9             74.077741
count_rech_3g_6             74.846748
count_rech_3g_7             74.428744
count_rech_3g_8             73.660737
count_rech_3g_9             74.077741
av_rech_amt_data_6          74.846748
av_rech_amt_data_7          74.428744
av_rech_amt_data_8          73.660737
av_rech_amt_data_9          74.077741
vol_2g_mb_6                  0.000000
vol_2g_mb_7                  0.000000
vol_2g_mb_8                  0.000000
vol_2g_mb_9                  0.000000
vol_3g_mb_6                  0.000000
vol_3g_mb_7                  0.000000
vol_3g_mb_8                  0.000000
vol_3g_mb_9                  0.000000
arpu_3g_6                   74.846748
arpu_3g_7                   74.428744
arpu_3g_8                   73.660737
arpu_3g_9                   74.077741
arpu_2g_6                   74.846748
arpu_2g_7                   74.428744
arpu_2g_8                   73.660737
arpu_2g_9                   74.077741
night_pck_user_6            74.846748
night_pck_user_7            74.428744
night_pck_user_8            73.660737
night_pck_user_9            74.077741
monthly_2g_6                 0.000000
monthly_2g_7                 0.000000
monthly_2g_8                 0.000000
monthly_2g_9                 0.000000
sachet_2g_6                  0.000000
sachet_2g_7                  0.000000
sachet_2g_8                  0.000000
sachet_2g_9                  0.000000
monthly_3g_6                 0.000000
monthly_3g_7                 0.000000
monthly_3g_8                 0.000000
monthly_3g_9                 0.000000
sachet_3g_6                  0.000000
sachet_3g_7                  0.000000
sachet_3g_8                  0.000000
sachet_3g_9                  0.000000
fb_user_6                   74.846748
fb_user_7                   74.428744
fb_user_8                   73.660737
fb_user_9                   74.077741
aon                          0.000000
aug_vbc_3g                   0.000000
jul_vbc_3g                   0.000000
jun_vbc_3g                   0.000000
sep_vbc_3g                   0.000000
dtype: float64

Impute missing values

Imputing with zeroes

In [8]:
# some recharge columns have minimum value of 1 while some don't
recharge_cols = ['total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 'total_rech_data_9',
                 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_2g_9',
                 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 'count_rech_3g_9',
                 'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 'max_rech_data_9',
                 'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'av_rech_amt_data_9',
                 ]

telecom[recharge_cols].describe(include='all')
Out[8]:
total_rech_data_6 total_rech_data_7 total_rech_data_8 total_rech_data_9 count_rech_2g_6 count_rech_2g_7 count_rech_2g_8 count_rech_2g_9 count_rech_3g_6 count_rech_3g_7 count_rech_3g_8 count_rech_3g_9 max_rech_data_6 max_rech_data_7 max_rech_data_8 max_rech_data_9 av_rech_amt_data_6 av_rech_amt_data_7 av_rech_amt_data_8 av_rech_amt_data_9
count 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.000000 25153.000000 25571.000000 26339.000000 25922.00000 25153.000000 25571.000000 26339.000000 25922.000000
mean 2.463802 2.666419 2.651999 2.441170 1.864668 2.044699 2.016288 1.781807 0.599133 0.621720 0.635711 0.659363 126.393392 126.729459 125.717301 124.94144 192.600982 200.981292 197.526489 192.734315
std 2.789128 3.031593 3.074987 2.516339 2.570254 2.768332 2.720132 2.214701 1.274428 1.394524 1.422827 1.411513 108.477235 109.765267 109.437851 111.36376 192.646318 196.791224 191.301305 188.400286
min 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.00000 1.000000 0.500000 0.500000 1.000000
25% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 25.000000 25.000000 25.000000 25.00000 82.000000 92.000000 87.000000 69.000000
50% 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 145.000000 145.000000 145.000000 145.00000 154.000000 154.000000 154.000000 164.000000
75% 3.000000 3.000000 3.000000 3.000000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 1.000000 1.000000 177.000000 177.000000 179.000000 179.00000 252.000000 252.000000 252.000000 252.000000
max 61.000000 54.000000 60.000000 84.000000 42.000000 48.000000 44.000000 40.000000 29.000000 35.000000 45.000000 49.000000 1555.000000 1555.000000 1555.000000 1555.00000 7546.000000 4365.000000 4076.000000 4061.000000
In [9]:
# It is also observed that the recharge date and the recharge value are missing together which means the customer didn't recharge
telecom.loc[telecom.total_rech_data_6.isnull() & telecom.date_of_last_rech_data_6.isnull(), ["total_rech_data_6", "date_of_last_rech_data_6"]].head(20)
Out[9]:
total_rech_data_6 date_of_last_rech_data_6
1 NaN NaN
2 NaN NaN
3 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
20 NaN NaN
21 NaN NaN
22 NaN NaN

In the recharge variables where minumum value is 1, we can impute missing values with zeroes since it means customer didn't recharge their numbere that month.

In [10]:
# Create a list of recharge columns where we will impute missing values with zeroes
zero_impute = ['total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 'total_rech_data_9',
        'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'av_rech_amt_data_9',
        'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 'max_rech_data_9'
       ]
In [11]:
# Impute missing values with 0
telecom[zero_impute] = telecom[zero_impute].apply(lambda x: x.fillna(0))
In [12]:
# let's make sure values are imputed correctly
print("Missing value ratio:\n")
print(telecom[zero_impute].isnull().sum()*100/telecom.shape[1])

# summary
print("\n\nSummary statistics\n")
print(telecom[zero_impute].describe(include='all'))
Missing value ratio:

total_rech_data_6     0.0
total_rech_data_7     0.0
total_rech_data_8     0.0
total_rech_data_9     0.0
av_rech_amt_data_6    0.0
av_rech_amt_data_7    0.0
av_rech_amt_data_8    0.0
av_rech_amt_data_9    0.0
max_rech_data_6       0.0
max_rech_data_7       0.0
max_rech_data_8       0.0
max_rech_data_9       0.0
dtype: float64


Summary statistics

       total_rech_data_6  total_rech_data_7  total_rech_data_8  \
count       99999.000000       99999.000000       99999.000000   
mean            0.619726           0.681837           0.698517   
std             1.760541           1.924382           1.963417   
min             0.000000           0.000000           0.000000   
25%             0.000000           0.000000           0.000000   
50%             0.000000           0.000000           0.000000   
75%             1.000000           1.000000           1.000000   
max            61.000000          54.000000          60.000000   

       total_rech_data_9  av_rech_amt_data_6  av_rech_amt_data_7  \
count       99999.000000        99999.000000        99999.000000   
mean            0.632806           48.445409           51.393440   
std             1.669040          127.743863          132.629365   
min             0.000000            0.000000            0.000000   
25%             0.000000            0.000000            0.000000   
50%             0.000000            0.000000            0.000000   
75%             1.000000            8.250000           17.000000   
max            84.000000         7546.000000         4365.000000   

       av_rech_amt_data_8  av_rech_amt_data_9  max_rech_data_6  \
count        99999.000000        99999.000000     99999.000000   
mean            52.027022           49.961089        31.792048   
std            131.182609          127.804280        77.248778   
min              0.000000            0.000000         0.000000   
25%              0.000000            0.000000         0.000000   
50%              0.000000            0.000000         0.000000   
75%             23.000000           17.000000         8.000000   
max           4076.000000         4061.000000      1555.000000   

       max_rech_data_7  max_rech_data_8  max_rech_data_9  
count     99999.000000     99999.000000     99999.000000  
mean         32.406314        33.113011        32.387644  
std          78.342435        78.872739        78.818696  
min           0.000000         0.000000         0.000000  
25%           0.000000         0.000000         0.000000  
50%           0.000000         0.000000         0.000000  
75%          14.000000        17.000000        17.000000  
max        1555.000000      1555.000000      1555.000000  
In [13]:
# Drop id and date columns
print("Shape before dropping: ", telecom.shape)
telecom = telecom.drop(id_cols + date_cols, axis=1)
print("Shape after dropping: ", telecom.shape)
Shape before dropping:  (99999, 226)
Shape after dropping:  (99999, 212)

Replace NaN values in categorical variables

We will replace missing values in the categorical values with '-1' where '-1' will be a new category

In [14]:
# Replace missing values with '-1' in categorical columns
telecom[cat_cols] = telecom[cat_cols].apply(lambda x: x.fillna(-1))
In [15]:
# missing value ratio
print("Missing value ratio:\n")
print(telecom[cat_cols].isnull().sum()*100/telecom.shape[0])
Missing value ratio:

night_pck_user_6    0.0
night_pck_user_7    0.0
night_pck_user_8    0.0
night_pck_user_9    0.0
fb_user_6           0.0
fb_user_7           0.0
fb_user_8           0.0
fb_user_9           0.0
dtype: float64

Drop variables with more than a given threshold of missing values

In [16]:
initial_cols = telecom.shape[1]

MISSING_THRESHOLD = 0.7

include_cols = list(telecom.apply(lambda column: True if column.isnull().sum()/telecom.shape[0] < MISSING_THRESHOLD else False))

drop_missing = pd.DataFrame({'features':telecom.columns , 'include': include_cols})
drop_missing.loc[drop_missing.include == True,:]
Out[16]:
features include
0 loc_og_t2o_mou True
1 std_og_t2o_mou True
2 loc_ic_t2o_mou True
3 arpu_6 True
4 arpu_7 True
5 arpu_8 True
6 arpu_9 True
7 onnet_mou_6 True
8 onnet_mou_7 True
9 onnet_mou_8 True
10 onnet_mou_9 True
11 offnet_mou_6 True
12 offnet_mou_7 True
13 offnet_mou_8 True
14 offnet_mou_9 True
15 roam_ic_mou_6 True
16 roam_ic_mou_7 True
17 roam_ic_mou_8 True
18 roam_ic_mou_9 True
19 roam_og_mou_6 True
20 roam_og_mou_7 True
21 roam_og_mou_8 True
22 roam_og_mou_9 True
23 loc_og_t2t_mou_6 True
24 loc_og_t2t_mou_7 True
25 loc_og_t2t_mou_8 True
26 loc_og_t2t_mou_9 True
27 loc_og_t2m_mou_6 True
28 loc_og_t2m_mou_7 True
29 loc_og_t2m_mou_8 True
30 loc_og_t2m_mou_9 True
31 loc_og_t2f_mou_6 True
32 loc_og_t2f_mou_7 True
33 loc_og_t2f_mou_8 True
34 loc_og_t2f_mou_9 True
35 loc_og_t2c_mou_6 True
36 loc_og_t2c_mou_7 True
37 loc_og_t2c_mou_8 True
38 loc_og_t2c_mou_9 True
39 loc_og_mou_6 True
40 loc_og_mou_7 True
41 loc_og_mou_8 True
42 loc_og_mou_9 True
43 std_og_t2t_mou_6 True
44 std_og_t2t_mou_7 True
45 std_og_t2t_mou_8 True
46 std_og_t2t_mou_9 True
47 std_og_t2m_mou_6 True
48 std_og_t2m_mou_7 True
49 std_og_t2m_mou_8 True
50 std_og_t2m_mou_9 True
51 std_og_t2f_mou_6 True
52 std_og_t2f_mou_7 True
53 std_og_t2f_mou_8 True
54 std_og_t2f_mou_9 True
55 std_og_t2c_mou_6 True
56 std_og_t2c_mou_7 True
57 std_og_t2c_mou_8 True
58 std_og_t2c_mou_9 True
59 std_og_mou_6 True
60 std_og_mou_7 True
61 std_og_mou_8 True
62 std_og_mou_9 True
63 isd_og_mou_6 True
64 isd_og_mou_7 True
65 isd_og_mou_8 True
66 isd_og_mou_9 True
67 spl_og_mou_6 True
68 spl_og_mou_7 True
69 spl_og_mou_8 True
70 spl_og_mou_9 True
71 og_others_6 True
72 og_others_7 True
73 og_others_8 True
74 og_others_9 True
75 total_og_mou_6 True
76 total_og_mou_7 True
77 total_og_mou_8 True
78 total_og_mou_9 True
79 loc_ic_t2t_mou_6 True
80 loc_ic_t2t_mou_7 True
81 loc_ic_t2t_mou_8 True
82 loc_ic_t2t_mou_9 True
83 loc_ic_t2m_mou_6 True
84 loc_ic_t2m_mou_7 True
85 loc_ic_t2m_mou_8 True
86 loc_ic_t2m_mou_9 True
87 loc_ic_t2f_mou_6 True
88 loc_ic_t2f_mou_7 True
89 loc_ic_t2f_mou_8 True
90 loc_ic_t2f_mou_9 True
91 loc_ic_mou_6 True
92 loc_ic_mou_7 True
93 loc_ic_mou_8 True
94 loc_ic_mou_9 True
95 std_ic_t2t_mou_6 True
96 std_ic_t2t_mou_7 True
97 std_ic_t2t_mou_8 True
98 std_ic_t2t_mou_9 True
99 std_ic_t2m_mou_6 True
100 std_ic_t2m_mou_7 True
101 std_ic_t2m_mou_8 True
102 std_ic_t2m_mou_9 True
103 std_ic_t2f_mou_6 True
104 std_ic_t2f_mou_7 True
105 std_ic_t2f_mou_8 True
106 std_ic_t2f_mou_9 True
107 std_ic_t2o_mou_6 True
108 std_ic_t2o_mou_7 True
109 std_ic_t2o_mou_8 True
110 std_ic_t2o_mou_9 True
111 std_ic_mou_6 True
112 std_ic_mou_7 True
113 std_ic_mou_8 True
114 std_ic_mou_9 True
115 total_ic_mou_6 True
116 total_ic_mou_7 True
117 total_ic_mou_8 True
118 total_ic_mou_9 True
119 spl_ic_mou_6 True
120 spl_ic_mou_7 True
121 spl_ic_mou_8 True
122 spl_ic_mou_9 True
123 isd_ic_mou_6 True
124 isd_ic_mou_7 True
125 isd_ic_mou_8 True
126 isd_ic_mou_9 True
127 ic_others_6 True
128 ic_others_7 True
129 ic_others_8 True
130 ic_others_9 True
131 total_rech_num_6 True
132 total_rech_num_7 True
133 total_rech_num_8 True
134 total_rech_num_9 True
135 total_rech_amt_6 True
136 total_rech_amt_7 True
137 total_rech_amt_8 True
138 total_rech_amt_9 True
139 max_rech_amt_6 True
140 max_rech_amt_7 True
141 max_rech_amt_8 True
142 max_rech_amt_9 True
143 last_day_rch_amt_6 True
144 last_day_rch_amt_7 True
145 last_day_rch_amt_8 True
146 last_day_rch_amt_9 True
147 total_rech_data_6 True
148 total_rech_data_7 True
149 total_rech_data_8 True
150 total_rech_data_9 True
151 max_rech_data_6 True
152 max_rech_data_7 True
153 max_rech_data_8 True
154 max_rech_data_9 True
163 av_rech_amt_data_6 True
164 av_rech_amt_data_7 True
165 av_rech_amt_data_8 True
166 av_rech_amt_data_9 True
167 vol_2g_mb_6 True
168 vol_2g_mb_7 True
169 vol_2g_mb_8 True
170 vol_2g_mb_9 True
171 vol_3g_mb_6 True
172 vol_3g_mb_7 True
173 vol_3g_mb_8 True
174 vol_3g_mb_9 True
183 night_pck_user_6 True
184 night_pck_user_7 True
185 night_pck_user_8 True
186 night_pck_user_9 True
187 monthly_2g_6 True
188 monthly_2g_7 True
189 monthly_2g_8 True
190 monthly_2g_9 True
191 sachet_2g_6 True
192 sachet_2g_7 True
193 sachet_2g_8 True
194 sachet_2g_9 True
195 monthly_3g_6 True
196 monthly_3g_7 True
197 monthly_3g_8 True
198 monthly_3g_9 True
199 sachet_3g_6 True
200 sachet_3g_7 True
201 sachet_3g_8 True
202 sachet_3g_9 True
203 fb_user_6 True
204 fb_user_7 True
205 fb_user_8 True
206 fb_user_9 True
207 aon True
208 aug_vbc_3g True
209 jul_vbc_3g True
210 jun_vbc_3g True
211 sep_vbc_3g True
In [17]:
# Drop columns
telecom = telecom.loc[:, include_cols]

dropped_cols = telecom.shape[1] - initial_cols
print("{0} columns dropped.".format(dropped_cols))
-16 columns dropped.
In [18]:
# Imputing the respective null columns as 0. 
telecom[telecom.select_dtypes(exclude=['datetime64[ns]','category']).columns.tolist()] = telecom[telecom.select_dtypes(exclude=['datetime64[ns]','category']).columns.tolist()].fillna(0, axis=1)

Filter high-value customers

Calculate total data recharge amount

In [19]:
# Calculate the total data recharge amount for June and July --> number of recharges * average recharge amount
telecom['total_data_rech_6'] = telecom.total_rech_data_6 * telecom.av_rech_amt_data_6
telecom['total_data_rech_7'] = telecom.total_rech_data_7 * telecom.av_rech_amt_data_7

Add total data recharge and total recharge to get total combined recharge amount for a month

In [20]:
# Calculate total recharge amount for June and July --> call recharge amount + data recharge amount
telecom['amt_data_6'] = telecom.total_rech_amt_6 + telecom.total_data_rech_6
telecom['amt_data_7'] = telecom.total_rech_amt_7 + telecom.total_data_rech_7
In [21]:
# Average recharge done by customer in June and July
telecom['av_amt_data_6_7'] = (telecom.amt_data_6 + telecom.amt_data_7)/2
In [22]:
# look at the 70th percentile recharge amount
print("Recharge amount at 70th percentile: {0}".format(telecom.av_amt_data_6_7.quantile(0.7)))
Recharge amount at 70th percentile: 478.0
In [23]:
# Retain only those customers who have recharged their mobiles with more than or equal to 70th percentile amount
telecom_filtered = telecom.loc[telecom.av_amt_data_6_7 >= telecom.av_amt_data_6_7.quantile(0.7), :]
telecom_filtered = telecom_filtered.reset_index(drop=True)
telecom_filtered.shape
Out[23]:
(30001, 201)
In [24]:
# Delete variables created to filter high-value customers
telecom_filtered = telecom_filtered.drop(['total_data_rech_6', 'total_data_rech_7',
                                      'amt_data_6', 'amt_data_7', 'av_amt_data_6_7'], axis=1)
telecom_filtered.shape
Out[24]:
(30001, 196)
In [25]:
telecom_filtered.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Columns: 196 entries, loc_og_t2o_mou to sep_vbc_3g
dtypes: float64(163), int64(33)
memory usage: 44.9 MB
In [26]:
# Churn Derivation
In [27]:
# calculate total incoming and outgoing minutes of usage
telecom_filtered['total_calls_mou_9'] = telecom_filtered.total_ic_mou_9 + telecom_filtered.total_og_mou_9
In [28]:
# calculate 2g and 3g data consumption
telecom_filtered['total_internet_mb_9'] =  telecom_filtered.vol_2g_mb_9 + telecom_filtered.vol_3g_mb_9
In [29]:
# create churn variable: those who have not used either calls or internet in the month of September are customers who have churned

# 0 - not churn, 1 - churn
telecom_filtered['churn'] = telecom_filtered.apply(lambda row: 1 if (row.total_calls_mou_9 == 0 and row.total_internet_mb_9 == 0) else 0, axis=1)
In [30]:
# verify that the churn is tagged correctly
In [31]:
telecom_filtered[['churn','total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9']]
Out[31]:
churn total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9
0 1 0.00 0.00 0.00 0.00
1 1 0.00 0.00 0.00 0.00
2 0 34.91 142.18 0.00 0.00
3 0 252.61 1031.53 0.00 0.00
4 0 127.48 127.63 13.93 314.74
... ... ... ... ... ...
29996 0 391.63 64.21 0.00 0.00
29997 0 149.26 14.44 723.41 232.56
29998 0 547.11 1095.24 9.14 941.48
29999 0 265.84 182.16 0.05 393.34
30000 0 2.21 0.00 85.67 4177.14

30001 rows × 5 columns

In [32]:
#After defining the y variable by tagging churners delete the columns having _9 attribute in their name
#After tagging churners, remove all the attributes corresponding to the churn phase
#(all attributes having ‘ _9’, etc. in their names).
#Get the list of columns which has name ends with _9 attribute
In [33]:
# find the columns which has name ends with _9 attribute
month_9_cols = telecom_filtered.columns[telecom_filtered.columns.str.endswith(pat = '_9')]
print("The columns names ends with _9 are given below...")
month_9_cols
The columns names ends with _9 are given below...
Out[33]:
Index(['arpu_9', 'onnet_mou_9', 'offnet_mou_9', 'roam_ic_mou_9',
       'roam_og_mou_9', 'loc_og_t2t_mou_9', 'loc_og_t2m_mou_9',
       'loc_og_t2f_mou_9', 'loc_og_t2c_mou_9', 'loc_og_mou_9',
       'std_og_t2t_mou_9', 'std_og_t2m_mou_9', 'std_og_t2f_mou_9',
       'std_og_t2c_mou_9', 'std_og_mou_9', 'isd_og_mou_9', 'spl_og_mou_9',
       'og_others_9', 'total_og_mou_9', 'loc_ic_t2t_mou_9', 'loc_ic_t2m_mou_9',
       'loc_ic_t2f_mou_9', 'loc_ic_mou_9', 'std_ic_t2t_mou_9',
       'std_ic_t2m_mou_9', 'std_ic_t2f_mou_9', 'std_ic_t2o_mou_9',
       'std_ic_mou_9', 'total_ic_mou_9', 'spl_ic_mou_9', 'isd_ic_mou_9',
       'ic_others_9', 'total_rech_num_9', 'total_rech_amt_9', 'max_rech_amt_9',
       'last_day_rch_amt_9', 'total_rech_data_9', 'max_rech_data_9',
       'av_rech_amt_data_9', 'vol_2g_mb_9', 'vol_3g_mb_9', 'night_pck_user_9',
       'monthly_2g_9', 'sachet_2g_9', 'monthly_3g_9', 'sachet_3g_9',
       'fb_user_9', 'total_calls_mou_9', 'total_internet_mb_9'],
      dtype='object')
In [34]:
# so, deleting derived variables
telecom_filtered = telecom_filtered.drop(['total_calls_mou_9', 'total_internet_mb_9'], axis=1)
In [35]:
# change data type to category
telecom_filtered.churn = telecom_filtered.churn.astype("category")

# print churn ratio
print("Churn Ratio:")
print(telecom_filtered.churn.value_counts()*100/telecom_filtered.shape[0])
Churn Ratio:
0    91.863605
1     8.136395
Name: churn, dtype: float64
In [36]:
# We see that approx. 8.1% of high value customers have churned
In [37]:
#calculate difference variable as the difference between 8th month and the average of 6th and 7th month.
In [38]:
telecom_filtered['arpu_diff'] = telecom_filtered.arpu_8 - ((telecom_filtered.arpu_6 + telecom_filtered.arpu_7)/2)

telecom_filtered['onnet_mou_diff'] = telecom_filtered.onnet_mou_8 - ((telecom_filtered.onnet_mou_6 + telecom_filtered.onnet_mou_7)/2)

telecom_filtered['offnet_mou_diff'] = telecom_filtered.offnet_mou_8 - ((telecom_filtered.offnet_mou_6 + telecom_filtered.offnet_mou_7)/2)

telecom_filtered['roam_ic_mou_diff'] = telecom_filtered.roam_ic_mou_8 - ((telecom_filtered.roam_ic_mou_6 + telecom_filtered.roam_ic_mou_7)/2)

telecom_filtered['roam_og_mou_diff'] = telecom_filtered.roam_og_mou_8 - ((telecom_filtered.roam_og_mou_6 + telecom_filtered.roam_og_mou_7)/2)

telecom_filtered['loc_og_mou_diff'] = telecom_filtered.loc_og_mou_8 - ((telecom_filtered.loc_og_mou_6 + telecom_filtered.loc_og_mou_7)/2)

telecom_filtered['std_og_mou_diff'] = telecom_filtered.std_og_mou_8 - ((telecom_filtered.std_og_mou_6 + telecom_filtered.std_og_mou_7)/2)

telecom_filtered['isd_og_mou_diff'] = telecom_filtered.isd_og_mou_8 - ((telecom_filtered.isd_og_mou_6 + telecom_filtered.isd_og_mou_7)/2)

telecom_filtered['spl_og_mou_diff'] = telecom_filtered.spl_og_mou_8 - ((telecom_filtered.spl_og_mou_6 + telecom_filtered.spl_og_mou_7)/2)

telecom_filtered['total_og_mou_diff'] = telecom_filtered.total_og_mou_8 - ((telecom_filtered.total_og_mou_6 + telecom_filtered.total_og_mou_7)/2)

telecom_filtered['loc_ic_mou_diff'] = telecom_filtered.loc_ic_mou_8 - ((telecom_filtered.loc_ic_mou_6 + telecom_filtered.loc_ic_mou_7)/2)

telecom_filtered['std_ic_mou_diff'] = telecom_filtered.std_ic_mou_8 - ((telecom_filtered.std_ic_mou_6 + telecom_filtered.std_ic_mou_7)/2)

telecom_filtered['isd_ic_mou_diff'] = telecom_filtered.isd_ic_mou_8 - ((telecom_filtered.isd_ic_mou_6 + telecom_filtered.isd_ic_mou_7)/2)

telecom_filtered['spl_ic_mou_diff'] = telecom_filtered.spl_ic_mou_8 - ((telecom_filtered.spl_ic_mou_6 + telecom_filtered.spl_ic_mou_7)/2)

telecom_filtered['total_ic_mou_diff'] = telecom_filtered.total_ic_mou_8 - ((telecom_filtered.total_ic_mou_6 + telecom_filtered.total_ic_mou_7)/2)

telecom_filtered['total_rech_num_diff'] = telecom_filtered.total_rech_num_8 - ((telecom_filtered.total_rech_num_6 + telecom_filtered.total_rech_num_7)/2)

telecom_filtered['total_rech_amt_diff'] = telecom_filtered.total_rech_amt_8 - ((telecom_filtered.total_rech_amt_6 + telecom_filtered.total_rech_amt_7)/2)

telecom_filtered['max_rech_amt_diff'] = telecom_filtered.max_rech_amt_8 - ((telecom_filtered.max_rech_amt_6 + telecom_filtered.max_rech_amt_7)/2)

telecom_filtered['total_rech_data_diff'] = telecom_filtered.total_rech_data_8 - ((telecom_filtered.total_rech_data_6 + telecom_filtered.total_rech_data_7)/2)

telecom_filtered['max_rech_data_diff'] = telecom_filtered.max_rech_data_8 - ((telecom_filtered.max_rech_data_6 + telecom_filtered.max_rech_data_7)/2)

telecom_filtered['av_rech_amt_data_diff'] = telecom_filtered.av_rech_amt_data_8 - ((telecom_filtered.av_rech_amt_data_6 + telecom_filtered.av_rech_amt_data_7)/2)

telecom_filtered['vol_2g_mb_diff'] = telecom_filtered.vol_2g_mb_8 - ((telecom_filtered.vol_2g_mb_6 + telecom_filtered.vol_2g_mb_7)/2)

telecom_filtered['vol_3g_mb_diff'] = telecom_filtered.vol_3g_mb_8 - ((telecom_filtered.vol_3g_mb_6 + telecom_filtered.vol_3g_mb_7)/2)
In [39]:
telecom_filtered['total_og_mou_diff'].describe()
Out[39]:
count    30001.000000
mean       -67.437337
std        502.630069
min      -7213.410000
25%       -168.025000
50%        -14.625000
75%         67.915000
max      12768.705000
Name: total_og_mou_diff, dtype: float64
In [40]:
# delete all variables relating to 9th month
telecom_filtered = telecom_filtered.filter(regex='[^9]$', axis=1)
telecom_filtered.shape
Out[40]:
(30001, 173)
In [41]:
# extract all names that end with 9
col_9_names = telecom.filter(regex='9$', axis=1).columns

# update num_cols and cat_cols column name list
cat_cols = [col for col in cat_cols if col not in col_9_names]
cat_cols.append('churn')
num_cols = [col for col in telecom_filtered.columns if col not in cat_cols]
In [42]:
# change column types
telecom_filtered[num_cols] = telecom_filtered[num_cols].apply(pd.to_numeric)
telecom_filtered[cat_cols] = telecom_filtered[cat_cols].apply(lambda column: column.astype("category"), axis=0)

EDA

Data Analysis(Univariate & Bivariate)
In [43]:
plt.figure(figsize=(20,20))
sns.heatmap(telecom_filtered.corr(), annot=True)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x3701f0>

Observations

As we can see the above plot is not very easy to understand let's check the numerical value of correlation between variables and find out if there are features with high correlation,then we can use PCA

In [44]:
# lets check the correlation amongst the features
cor = telecom_filtered.corr()
cor.loc[:,:] = np.tril(cor, k=-1)
cor = cor.stack()
cor.sort_values(ascending=False)
Out[44]:
total_rech_amt_8  arpu_8           0.955351
total_rech_amt_6  arpu_6           0.946125
total_rech_amt_7  arpu_7           0.943588
isd_og_mou_8      isd_og_mou_7     0.943120
                  isd_og_mou_6     0.918187
                                     ...   
roam_ic_mou_diff  roam_ic_mou_6   -0.453394
roam_og_mou_diff  roam_og_mou_6   -0.458500
isd_ic_mou_diff   isd_ic_mou_6    -0.519470
spl_ic_mou_diff   spl_ic_mou_6    -0.599969
                  spl_ic_mou_7    -0.668119
Length: 26107, dtype: float64

Oberservations

There is very high correlation between many variables, so there is a scope of dimensionality reduction, we would take a look at it in the next segment.

Let us create some utility methods to ease the EDA process.

In [45]:
# Method for numerical data to plot distribution plots (Univariate Analysis)
def distributionPlot(feature):
    fig = plt.figure(figsize=(14,6))
    ax1 = fig.add_subplot(1,2,1)
    ax2 = fig.add_subplot(1,2,2)
    sns.distplot(feature, norm_hist = True, ax = ax1)
    sns.boxplot(feature, orient = 'v', ax = ax2)
    
# Reusable method to plot scatter plot (For Bivariate Analysis)
def scatterPlot(y, x):
    plt.figure(figsize=(10,8))
    sns.scatterplot(x = x, y = y)
Univariate Analysis

Let's try to do some univariate analysis

In [46]:
sns.countplot(telecom_filtered.churn)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x8e8b10>

This clearly shows that there is high class imbalance between the churn and non-churn as amount of churning customers is very less in comparison to the non-churning ones.

In [47]:
distributionPlot(telecom_filtered.arpu_6)

Observations

The field arpu_6 outliers due to which the distribution graph is also skewedwhich means it's mean & median would be greater than it's mode.

Let's check a few more fields

In [48]:
distributionPlot(telecom_filtered.total_rech_amt_8)

It is weird that there are some transactions crossing the 40000 mark. It might be a wrong data entry or someone with very high usage of phone.

Let's check the age on numbers distribution and see how the customers are distributed to different life-spans.

In [49]:
distributionPlot(telecom_filtered.onnet_mou_8)
Bivariate Analysis

Let's try to do some Bivariate analysis

In [50]:
help(scatterPlot)
Help on function scatterPlot in module __main__:

scatterPlot(y, x)
    # Reusable method to plot scatter plot (For Bivariate Analysis)

In [51]:
# lets now draw a scatter plot between total recharge and avg revenue for the 8th month
scatterPlot(telecom_filtered.total_rech_num_8, telecom_filtered.arpu_8)
Churn Vs important predictors
In [52]:
sns.boxplot(x = telecom_filtered.churn, y = telecom_filtered.sep_vbc_3g)
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0xb728a10>
In [53]:
sns.boxplot(x = telecom_filtered.churn, y = telecom_filtered.aon)
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0xb8bcc10>
In [54]:
sns.boxplot(x = telecom_filtered.churn, y = telecom_filtered.spl_ic_mou_diff)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0xb901f70>
In [55]:
sns.boxplot(x = telecom_filtered.churn, y = telecom_filtered.total_ic_mou_diff)
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0xb9487d0>
In [56]:
sns.boxplot(x = telecom_filtered.churn, y = telecom_filtered.total_og_mou_8)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0xb9884b0>
In [57]:
sns.boxplot(x = telecom_filtered.churn, y = np.log(telecom_filtered.max_rech_amt_8))
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0xb9d6530>

Oberservations

  • It is evident that the amount spend on recharge in the action phase is less.
  • The graph shows the non-churning customers spend equally likely in the action as well as good phase.

This means that when customers are about to churn they reduce the amount spend on the recharge as they might just want to cope up to the time they churn out.

In [58]:
# churn vs max recharge amount
plt.figure(figsize=(12,4))
ax = sns.kdeplot(telecom_filtered.max_rech_amt_8[(telecom_filtered["churn"] == 0)],
                color="Red", shade = True)
ax = sns.kdeplot(telecom_filtered.max_rech_amt_8[(telecom_filtered["churn"] == 1)],
                ax =ax, color="Blue", shade= True)
In [63]:
# create function to anlyze the features across 6th, 7th and 8th month
def analyze_feature(feature_start):
    plt.rcParams["figure.figsize"] = [17, 7]
    fig = plt.figure()
    print("Churn Stats (mean and standard deviation):")
    cols = {c: feature_start + "_" + str(c) for c in [6,7,8]}
    print(cols)
    frame_data = []
    [frame_data.append({
        "mean_6": telecom_filtered[telecom_filtered["churn"] == churn][cols[6]].mean(),
        "mean_7": telecom_filtered[telecom_filtered["churn"] == churn][cols[7]].mean(),
        "mean_8": telecom_filtered[telecom_filtered["churn"] == churn][cols[8]].mean(),
        "std_6": telecom_filtered[telecom_filtered["churn"] == churn][cols[6]].std(),
        "std_7": telecom_filtered[telecom_filtered["churn"] == churn][cols[7]].std(),
        "std_8": telecom_filtered[telecom_filtered["churn"] == churn][cols[8]].std()
    }) for churn in [0,1]]
    
    f,axes = plt.subplots(nrows=1, ncols=3)
    sns.boxplot(data=telecom_filtered, y=feature_start + "_6", x="churn",
                hue="churn", linewidth=2.5, showfliers=False, ax=axes[0])
    sns.boxplot(data=telecom_filtered, y=feature_start + "_7",  x="churn",
                hue="churn", linewidth=2.5, showfliers=False, ax=axes[1])
    sns.boxplot(data=telecom_filtered, y=feature_start + "_8",  x="churn",
                hue="churn", linewidth=2.5, showfliers=False, ax=axes[2])
    return pd.DataFrame(frame_data,index={"Non Churned","Churned"}).round(2)
In [64]:
analyze_feature('total_rech_amt')
Churn Stats (mean and standard deviation):
{6: 'total_rech_amt_6', 7: 'total_rech_amt_7', 8: 'total_rech_amt_8'}
Out[64]:
mean_6 mean_7 mean_8 std_6 std_7 std_8
Churned 649.57 663.53 613.70 547.70 574.16 614.83
Non Churned 786.71 626.29 253.15 681.56 667.94 452.36
<Figure size 1224x504 with 0 Axes>
In [67]:
recharge_amnt_columns =  telecom_filtered.columns[telecom_filtered.columns.str.contains('rech_amt')]

recharge_amnt_columns
Out[67]:
Index(['total_rech_amt_6', 'total_rech_amt_7', 'total_rech_amt_8',
       'max_rech_amt_6', 'max_rech_amt_7', 'max_rech_amt_8',
       'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8',
       'total_rech_amt_diff', 'max_rech_amt_diff', 'av_rech_amt_data_diff'],
      dtype='object')
In [68]:
# let's analyze total recharge amount for data:
analyze_feature('max_rech_amt')
Churn Stats (mean and standard deviation):
{6: 'max_rech_amt_6', 7: 'max_rech_amt_7', 8: 'max_rech_amt_8'}
Out[68]:
mean_6 mean_7 mean_8 std_6 std_7 std_8
Churned 169.05 173.34 166.77 172.00 176.75 171.34
Non Churned 171.94 159.39 85.37 209.12 226.97 176.65
<Figure size 1224x504 with 0 Axes>
In [69]:
pd.crosstab(telecom_filtered.churn, telecom_filtered.night_pck_user_8, normalize='columns')*100
Out[69]:
night_pck_user_8 -1.0 0.0 1.0
churn
0 85.89123 97.117602 97.360704
1 14.10877 2.882398 2.639296
In [70]:
pd.crosstab(telecom_filtered.churn, telecom_filtered.sachet_3g_8)
Out[70]:
sachet_3g_8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 25 27 29 30 38 41
churn
0 24972 1609 399 184 106 86 43 35 28 19 15 8 11 10 6 6 2 2 3 1 3 3 2 1 1 1 2 1 1
1 2369 48 5 8 4 2 1 0 2 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
In [73]:
def cap_outliers(array, k=3):
    upper_limit = array.mean() + k*array.std()
    lower_limit = array.mean() - k*array.std()
    array[array<lower_limit] = lower_limit
    array[array>upper_limit] = upper_limit
    return array
In [74]:
# example of capping
sample_array = list(range(100))

# add outliers to the data
sample_array[0] = -9999
sample_array[99] = 9999

# cap outliers
sample_array = np.array(sample_array)
print("Array after capping outliers: \n", cap_outliers(sample_array, k=2))
Array after capping outliers: 
 [-2780     1     2     3     4     5     6     7     8     9    10    11
    12    13    14    15    16    17    18    19    20    21    22    23
    24    25    26    27    28    29    30    31    32    33    34    35
    36    37    38    39    40    41    42    43    44    45    46    47
    48    49    50    51    52    53    54    55    56    57    58    59
    60    61    62    63    64    65    66    67    68    69    70    71
    72    73    74    75    76    77    78    79    80    81    82    83
    84    85    86    87    88    89    90    91    92    93    94    95
    96    97    98  2877]
In [75]:
# cap outliers in the numeric columns
telecom_filtered[num_cols] = telecom_filtered[num_cols].apply(cap_outliers, axis=0)
In [76]:
telecom_filtered.describe(percentiles=[0.01, 0.10,.25,.5,.75,.90,.95,.99])
Out[76]:
loc_og_t2o_mou std_og_t2o_mou loc_ic_t2o_mou arpu_6 arpu_7 arpu_8 onnet_mou_6 onnet_mou_7 onnet_mou_8 offnet_mou_6 offnet_mou_7 offnet_mou_8 roam_ic_mou_6 roam_ic_mou_7 roam_ic_mou_8 roam_og_mou_6 roam_og_mou_7 roam_og_mou_8 loc_og_t2t_mou_6 loc_og_t2t_mou_7 loc_og_t2t_mou_8 loc_og_t2m_mou_6 loc_og_t2m_mou_7 loc_og_t2m_mou_8 loc_og_t2f_mou_6 loc_og_t2f_mou_7 loc_og_t2f_mou_8 loc_og_t2c_mou_6 loc_og_t2c_mou_7 loc_og_t2c_mou_8 loc_og_mou_6 loc_og_mou_7 loc_og_mou_8 std_og_t2t_mou_6 std_og_t2t_mou_7 std_og_t2t_mou_8 std_og_t2m_mou_6 std_og_t2m_mou_7 std_og_t2m_mou_8 std_og_t2f_mou_6 std_og_t2f_mou_7 std_og_t2f_mou_8 std_og_t2c_mou_6 std_og_t2c_mou_7 std_og_t2c_mou_8 std_og_mou_6 std_og_mou_7 std_og_mou_8 isd_og_mou_6 isd_og_mou_7 isd_og_mou_8 spl_og_mou_6 spl_og_mou_7 spl_og_mou_8 og_others_6 og_others_7 og_others_8 total_og_mou_6 total_og_mou_7 total_og_mou_8 loc_ic_t2t_mou_6 loc_ic_t2t_mou_7 loc_ic_t2t_mou_8 loc_ic_t2m_mou_6 loc_ic_t2m_mou_7 loc_ic_t2m_mou_8 loc_ic_t2f_mou_6 loc_ic_t2f_mou_7 loc_ic_t2f_mou_8 loc_ic_mou_6 loc_ic_mou_7 loc_ic_mou_8 std_ic_t2t_mou_6 std_ic_t2t_mou_7 std_ic_t2t_mou_8 std_ic_t2m_mou_6 std_ic_t2m_mou_7 std_ic_t2m_mou_8 std_ic_t2f_mou_6 std_ic_t2f_mou_7 std_ic_t2f_mou_8 std_ic_t2o_mou_6 std_ic_t2o_mou_7 std_ic_t2o_mou_8 std_ic_mou_6 std_ic_mou_7 std_ic_mou_8 total_ic_mou_6 total_ic_mou_7 total_ic_mou_8 spl_ic_mou_6 spl_ic_mou_7 spl_ic_mou_8 isd_ic_mou_6 isd_ic_mou_7 isd_ic_mou_8 ic_others_6 ic_others_7 ic_others_8 total_rech_num_6 total_rech_num_7 total_rech_num_8 total_rech_amt_6 total_rech_amt_7 total_rech_amt_8 max_rech_amt_6 max_rech_amt_7 max_rech_amt_8 last_day_rch_amt_6 last_day_rch_amt_7 last_day_rch_amt_8 total_rech_data_6 total_rech_data_7 total_rech_data_8 max_rech_data_6 max_rech_data_7 max_rech_data_8 av_rech_amt_data_6 av_rech_amt_data_7 av_rech_amt_data_8 vol_2g_mb_6 vol_2g_mb_7 vol_2g_mb_8 vol_3g_mb_6 vol_3g_mb_7 vol_3g_mb_8 monthly_2g_6 monthly_2g_7 monthly_2g_8 sachet_2g_6 sachet_2g_7 sachet_2g_8 monthly_3g_6 monthly_3g_7 monthly_3g_8 sachet_3g_6 sachet_3g_7 sachet_3g_8 aon aug_vbc_3g jul_vbc_3g jun_vbc_3g sep_vbc_3g arpu_diff onnet_mou_diff offnet_mou_diff roam_ic_mou_diff roam_og_mou_diff loc_og_mou_diff std_og_mou_diff isd_og_mou_diff spl_og_mou_diff total_og_mou_diff loc_ic_mou_diff std_ic_mou_diff isd_ic_mou_diff spl_ic_mou_diff total_ic_mou_diff total_rech_num_diff total_rech_amt_diff max_rech_amt_diff total_rech_data_diff max_rech_data_diff av_rech_amt_data_diff vol_2g_mb_diff vol_3g_mb_diff
count 30001.0 30001.0 30001.0 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.0 30001.0 30001.0 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.0 30001.0 30001.0 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000 30001.000000
mean 0.0 0.0 0.0 548.190437 550.814573 498.441171 244.520332 251.243475 216.618419 358.669154 362.067463 319.430161 12.324453 9.182973 9.218402 20.590115 14.991576 14.900044 73.430226 74.173410 67.990023 154.589624 155.166695 143.999326 5.467283 5.550102 5.061197 1.307066 1.588277 1.423773 241.055646 242.367797 223.979676 150.313802 158.568321 131.560295 162.589594 170.121749 140.224822 1.258933 1.251773 1.090205 0.0 0.0 0.0 329.829118 345.583610 287.333744 1.295685 1.256724 1.087310 5.148315 6.358940 5.771581 0.607632 0.012704 0.016199 593.477978 611.006104 533.001799 58.316939 59.012023 55.273497 142.381184 143.693593 138.688508 13.251308 13.891375 12.483237 220.221370 223.079516 212.700800 12.005568 12.647887 11.169672 24.955743 26.035983 23.654207 2.042620 2.135998 1.953896 0.0 0.0 0.0 40.718255 42.555436 38.246218 277.926328 283.954653 268.064438 0.058812 0.016700 0.020953 7.817988 8.606003 7.705977 0.819634 1.032470 0.855009 11.606818 11.470941 9.738043 648.062189 648.223689 572.185910 162.247357 164.681944 153.829721 95.253705 94.876863 87.397390 1.578785 1.764332 1.522783 75.323064 77.071625 69.849879 125.275839 133.671573 116.133092 113.086612 110.827412 92.308197 311.305862 334.677696 316.753516 0.162925 0.170937 0.143327 0.976537 1.127116 0.961168 0.196946 0.205904 0.189431 0.162736 0.174138 0.148882 1209.265827 152.768184 161.311677 141.546609 5.496166 -52.328322 -29.290383 -39.933045 -2.082643 -3.718059 -18.039846 -47.314995 -0.170902 -0.085567 -67.192323 -9.080333 -3.626018 -0.380302 -0.013693 -12.917269 -1.773654 -77.372231 -9.033386 -0.150244 -6.306859 -13.323290 -20.329083 -9.372739
std 0.0 0.0 0.0 356.617955 361.120770 384.493319 371.616516 387.608598 354.670696 405.252913 413.547205 386.603835 41.750280 35.649995 35.301200 66.900217 52.507665 53.593739 133.710618 136.275589 126.690738 199.750626 197.186851 190.418703 12.760515 12.820832 11.753345 3.693101 4.580213 4.063138 303.277239 300.899122 290.389270 315.832519 331.879247 296.808431 314.006621 330.471067 291.213051 5.203727 5.317938 4.640543 0.0 0.0 0.0 524.488985 548.223773 498.961062 10.699686 10.317350 9.871402 10.600781 12.550877 11.636151 1.697840 0.269649 0.328506 613.798254 633.909173 602.715259 93.295904 94.162013 89.899924 169.009913 167.814113 167.897746 27.962660 29.651869 26.368381 249.910673 250.191627 247.665389 32.766539 34.735061 31.399101 52.663444 55.417836 52.467314 7.624020 7.869478 7.421909 0.0 0.0 0.0 78.564475 82.766840 77.159409 296.692217 300.818719 295.703932 0.136661 0.075829 0.075179 31.468912 33.414192 30.225982 3.914676 4.746921 3.934353 8.098621 8.234645 8.112399 430.636601 441.268660 463.024954 130.747590 134.287275 129.472802 103.752260 104.723447 105.408310 2.353920 2.602330 2.402273 98.306076 99.433675 96.883608 164.230463 171.083025 162.313141 242.554523 240.206372 216.560879 631.276715 653.840360 646.002538 0.387603 0.400701 0.360264 2.073438 2.315567 2.066318 0.492166 0.515368 0.490119 0.537111 0.589553 0.557659 957.361979 321.412159 339.112777 310.332338 24.137022 287.305975 233.050772 270.164355 37.609989 57.643102 163.858439 337.195350 4.593713 11.614253 424.612747 137.043194 55.749839 24.280635 0.117280 172.748962 5.868877 361.299079 101.208408 1.818767 68.254102 126.482299 187.566599 453.816505
min 0.0 0.0 0.0 -823.430558 -878.548639 -945.808000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 180.000000 0.000000 0.000000 0.000000 0.000000 -1055.655521 -956.391357 -1089.126198 -199.792064 -285.992824 -660.171120 -1332.033994 -42.193478 -54.170733 -1575.327545 -551.869803 -298.284232 -170.540976 -0.537392 -679.130516 -22.079752 -1331.687092 -421.277866 -6.782528 -241.497690 -457.835482 -788.241461 -1880.366300
1% 0.0 0.0 0.0 1.000000 0.700000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 188.000000 0.000000 0.000000 0.000000 0.000000 -956.267000 -956.391357 -1089.126198 -179.935000 -285.992824 -660.171120 -1332.033994 -14.410000 -38.590000 -1575.327545 -551.610000 -239.330000 -93.005000 -0.380000 -679.130516 -22.079752 -1206.000000 -421.277866 -6.782528 -241.497690 -449.000000 -788.241461 -1727.860000
10% 0.0 0.0 0.0 171.605000 177.886000 84.000000 0.700000 0.580000 0.000000 11.260000 10.430000 2.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.380000 2.640000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 10.760000 10.080000 0.210000 0.280000 0.460000 0.000000 5.160000 6.690000 1.350000 0.000000 0.000000 0.000000 10.040000 12.990000 3.030000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 18.190000 22.210000 6.250000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 4.000000 2.000000 195.000000 192.000000 56.000000 37.000000 36.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 284.000000 0.000000 0.000000 0.000000 0.000000 -385.531000 -215.110000 -291.760000 -14.525000 -28.340000 -171.035000 -354.895000 0.000000 -9.415000 -523.835000 -134.450000 -39.275000 -5.130000 -0.130000 -174.290000 -8.000000 -506.000000 -109.500000 -2.000000 -88.500000 -152.000000 -168.225000 -383.960000
25% 0.0 0.0 0.0 309.865000 309.826000 231.473000 17.080000 16.030000 10.390000 71.610000 69.910000 46.740000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.380000 4.610000 2.530000 19.210000 20.460000 13.090000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 32.260000 34.480000 22.460000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.380000 0.150000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 116.580000 116.490000 72.360000 6.090000 6.780000 4.880000 28.080000 30.780000 24.210000 0.000000 0.000000 0.000000 47.880000 52.510000 40.530000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 1.030000 1.050000 0.410000 72.880000 78.990000 61.360000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 6.000000 4.000000 360.000000 352.000000 250.000000 110.000000 110.000000 50.000000 25.000000 20.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 460.000000 0.000000 0.000000 0.000000 0.000000 -180.179500 -52.345000 -100.900000 0.000000 0.000000 -58.880000 -54.595000 0.000000 -2.750000 -168.025000 -49.805000 -11.905000 0.000000 0.000000 -67.230000 -4.000000 -245.000000 -33.000000 -0.500000 -7.000000 -59.000000 -21.120000 -2.270000
50% 0.0 0.0 0.0 481.694000 480.943000 427.585000 84.580000 82.810000 65.610000 222.540000 220.030000 182.790000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 24.330000 24.680000 20.730000 77.180000 78.880000 67.760000 0.000000 0.100000 0.000000 0.000000 0.000000 0.000000 124.540000 127.880000 110.810000 2.710000 2.680000 0.630000 16.360000 16.040000 9.460000 0.000000 0.000000 0.000000 0.0 0.0 0.0 45.130000 43.980000 25.480000 0.000000 0.000000 0.000000 0.210000 0.710000 0.430000 0.000000 0.000000 0.000000 410.630000 425.590000 333.540000 25.080000 25.780000 22.610000 84.060000 85.660000 80.040000 1.840000 2.000000 1.680000 134.490000 137.590000 128.730000 0.380000 0.450000 0.110000 5.000000 5.180000 3.710000 0.000000 0.000000 0.000000 0.0 0.0 0.0 11.410000 12.090000 9.290000 183.730000 187.580000 173.110000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 9.000000 8.000000 566.000000 562.000000 490.000000 120.000000 128.000000 130.000000 67.000000 50.000000 50.000000 1.000000 1.000000 1.000000 25.000000 25.000000 23.000000 50.000000 69.000000 25.000000 0.000000 0.050000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 846.000000 0.000000 0.000000 0.000000 0.000000 -40.754500 -3.210000 -11.025000 0.000000 0.000000 -4.955000 -0.405000 0.000000 0.000000 -14.625000 -3.940000 -0.640000 0.000000 0.000000 -5.610000 -1.500000 -57.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.0 0.0 0.0 699.943000 698.315000 661.491000 290.440000 290.240000 239.960000 487.940000 494.010000 438.890000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 77.980000 78.340000 72.040000 208.930000 208.410000 196.960000 4.230000 4.380000 3.930000 0.000000 0.100000 0.010000 332.090000 333.390000 310.610000 109.610000 116.610000 78.190000 154.540000 160.280000 117.780000 0.000000 0.000000 0.000000 0.0 0.0 0.0 465.390000 499.290000 357.230000 0.000000 0.000000 0.000000 5.210000 6.940000 6.150000 0.000000 0.000000 0.000000 858.760000 891.440000 774.780000 67.430000 68.640000 65.080000 188.710000 190.660000 185.730000 11.980000 12.240000 11.160000 297.140000 298.710000 285.440000 8.200000 8.590000 7.040000 23.600000 24.940000 21.810000 0.110000 0.200000 0.080000 0.0 0.0 0.0 41.490000 43.810000 38.380000 371.990000 377.310000 361.730000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.050000 0.000000 0.000000 15.000000 15.000000 13.000000 834.000000 831.000000 776.000000 200.000000 200.000000 198.000000 120.000000 130.000000 130.000000 2.000000 2.000000 2.000000 154.000000 154.000000 154.000000 198.000000 210.000000 196.000000 83.450000 77.770000 51.650000 358.740000 410.700000 356.190000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1756.000000 128.800000 137.290000 98.440000 0.000000 83.239000 18.410000 41.040000 0.000000 0.000000 30.125000 8.105000 0.000000 1.310000 67.915000 34.405000 4.995000 0.000000 0.000000 44.280000 1.000000 99.500000 25.000000 0.000000 0.000000 9.500000 0.000000 0.000000
90% 0.0 0.0 0.0 994.099000 995.859000 977.345000 754.160000 784.480000 665.080000 895.830000 916.080000 823.680000 27.390000 14.290000 15.010000 50.430000 31.090000 28.880000 187.930000 190.840000 178.840000 419.480000 421.880000 398.510000 16.130000 16.290000 15.160000 4.210000 5.010000 4.580000 639.290000 636.210000 607.830000 567.060000 599.340000 475.360000 549.740000 586.660000 464.890000 1.660000 1.650000 1.280000 0.0 0.0 0.0 1114.810000 1153.580000 1002.860000 0.000000 0.000000 0.000000 15.940000 19.260000 17.610000 2.060000 0.000000 0.000000 1447.080000 1481.160000 1359.590000 147.860000 149.480000 140.460000 355.040000 353.440000 352.060000 38.010000 39.940000 36.610000 544.030000 545.960000 537.060000 31.210000 32.340000 28.240000 67.540000 69.480000 62.940000 4.030000 4.330000 3.780000 0.0 0.0 0.0 107.340000 109.930000 99.980000 665.110000 673.230000 650.230000 0.250000 0.000000 0.000000 13.360000 15.860000 13.550000 1.080000 1.480000 1.130000 23.000000 23.000000 21.000000 1181.000000 1194.000000 1148.000000 274.000000 274.000000 252.000000 250.000000 250.000000 225.000000 5.000000 5.000000 5.000000 252.000000 252.000000 252.000000 308.000000 344.000000 308.000000 409.790000 399.920000 318.950000 1089.230000 1171.250000 1105.170000 1.000000 1.000000 1.000000 4.000000 5.000000 4.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 2793.000000 580.680000 611.310000 532.610000 0.900000 247.144000 112.520000 171.550000 4.015000 6.895000 119.545000 159.880000 0.000000 9.480000 291.035000 111.930000 28.610000 2.025000 0.000000 141.720000 4.500000 301.000000 71.000000 1.500000 50.000000 116.000000 68.235000 310.945000
95% 0.0 0.0 0.0 1240.964000 1261.272000 1255.019000 1135.440000 1185.790000 1074.590000 1256.610000 1272.290000 1167.540000 84.540000 55.640000 56.350000 145.410000 104.240000 100.510000 322.740000 324.390000 298.780000 604.380000 600.460000 574.760000 30.430000 31.130000 28.390000 8.880000 10.190000 9.380000 913.490000 905.660000 859.780000 965.530000 1011.680000 887.460000 931.910000 976.560000 842.830000 6.830000 6.680000 5.930000 0.0 0.0 0.0 1552.330000 1614.260000 1476.160000 0.810000 0.700000 0.360000 26.640000 30.990000 29.280000 3.890000 0.000000 0.000000 1913.480000 1951.760000 1857.080000 234.180000 233.940000 224.330000 507.690000 510.360000 509.810000 71.560000 73.340000 65.310000 759.810000 771.960000 758.560000 60.280000 62.090000 56.280000 120.430000 122.090000 114.080000 11.180000 11.530000 10.490000 0.0 0.0 0.0 185.940000 194.190000 181.080000 922.080000 929.310000 915.690000 0.410000 0.110000 0.250000 42.080000 46.750000 42.310000 3.410000 4.590000 3.730000 29.000000 29.000000 28.000000 1480.000000 1510.000000 1490.000000 459.000000 459.000000 440.000000 252.000000 252.000000 252.000000 7.000000 8.000000 7.000000 252.000000 252.000000 252.000000 495.000000 504.000000 495.000000 702.290000 690.770000 613.020000 1714.340000 1785.990000 1736.240000 1.000000 1.000000 1.000000 6.000000 7.000000 6.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3140.000000 938.560000 984.930000 885.240000 31.310000 392.448000 258.205000 329.705000 26.210000 50.140000 213.780000 383.110000 0.000000 17.485000 539.335000 193.090000 59.315000 14.440000 0.150000 240.830000 7.500000 480.000000 116.500000 3.000000 103.000000 203.000000 207.510000 701.370000
99% 0.0 0.0 0.0 1940.412206 1999.500000 1986.622000 1639.726127 1707.801260 1609.458333 1821.264635 1874.873834 1781.264570 245.016822 240.000212 234.876394 375.188122 308.453369 334.022642 770.866765 807.252285 760.197938 913.593397 887.740079 852.739528 72.714799 72.256232 64.393228 20.485258 27.980000 23.332161 1389.097571 1381.373228 1325.838689 1377.381192 1437.804502 1341.134816 1406.510831 1488.817198 1384.650920 37.741836 39.640000 34.380144 0.0 0.0 0.0 2173.013101 2283.382515 2134.308828 38.280000 37.230000 28.690000 58.519980 72.473557 67.492774 8.890000 0.000000 0.000000 2682.486542 2780.023559 2653.113353 556.669926 568.848413 539.356311 808.009301 806.354991 790.880604 152.534278 163.533482 144.318747 1175.617364 1182.292626 1148.210139 208.410000 223.040000 214.580000 323.826647 347.816015 337.948080 51.380000 53.490000 45.930000 0.0 0.0 0.0 459.293842 493.529461 460.841569 1386.520920 1417.998684 1363.448456 0.625209 0.490000 0.358659 224.260000 235.330000 222.800166 19.980000 25.740000 21.510000 40.138401 40.016098 37.795976 2343.830937 2407.790643 2341.000000 695.232696 716.373953 679.767860 532.106319 523.133029 527.940627 10.161410 11.129334 10.609944 416.273968 422.661540 404.195475 731.638486 760.208022 708.893089 1118.114565 1119.741356 1046.067585 3085.501901 3116.774434 3108.557167 1.479353 1.534530 1.373086 8.587655 9.422499 8.662684 2.057355 2.206400 2.063672 3.166415 3.497280 3.502715 3651.000000 1432.061922 1509.505179 1408.728543 165.164485 856.402000 861.070000 877.245000 176.080000 250.680000 567.955000 1184.435000 7.080000 47.240000 1400.040000 507.295000 220.275000 89.780000 0.500000 621.950000 16.500000 1021.500000 323.000000 6.500000 227.000000 431.663891 748.448094 1868.899027
max 0.0 0.0 0.0 1940.412206 2000.113045 2014.483901 1639.726127 1707.801260 1609.458333 1821.264635 1874.873834 1781.264570 245.016822 240.000212 234.876394 375.188122 308.453369 334.022642 770.866765 807.252285 760.197938 913.593397 887.740079 852.739528 72.714799 72.256232 64.393228 20.485258 29.366826 23.332161 1389.097571 1381.373228 1325.838689 1377.381192 1437.804502 1341.134816 1406.510831 1488.817198 1384.650920 37.741836 40.766857 34.380144 0.0 0.0 0.0 2173.013101 2283.382515 2134.308828 139.728096 138.741424 136.085042 58.519980 72.473557 67.492774 12.984971 8.073286 10.002983 2682.486542 2780.023559 2653.113353 556.669926 568.848413 539.356311 808.009301 806.354991 790.880604 152.534278 163.533482 144.318747 1175.617364 1182.292626 1148.210139 247.981822 264.549695 227.929271 323.826647 347.816015 337.948080 59.995636 61.341956 62.252784 0.0 0.0 0.0 459.293842 493.529461 460.841569 1386.520920 1417.998684 1363.448456 0.625209 0.565340 0.358659 240.972581 252.413731 222.800166 43.111252 49.492257 39.574168 40.138401 40.016098 37.795976 2343.830937 2407.790643 2418.103175 695.232696 716.373953 679.767860 532.106319 523.133029 527.940627 10.161410 11.129334 10.609944 416.273968 422.661540 404.195475 731.638486 760.208022 708.893089 1118.114565 1119.741356 1046.067585 3085.501901 3116.774434 3108.557167 1.479353 1.534530 1.373086 8.587655 9.422499 8.662684 2.057355 2.206400 2.063672 3.166415 3.497280 3.502715 4082.402722 1432.061922 1509.505179 1408.728543 165.164485 953.578407 896.004117 1007.483157 196.040309 279.000421 622.430788 1235.682161 41.814149 54.358224 1440.452871 532.937973 291.251991 169.877005 0.509726 652.268895 18.455640 1179.187642 400.080239 6.504837 228.562154 431.663891 748.448094 1868.899027

If we look the pattern followed by the customers on spending in the two categories churning and non-churning, it is clearly visible that both of them follow every different types of spending and such behaviour can be an indicator of churning vs non-churning.

Splitting the Data into Training and Testing Sets

Let's start the model builing but before that, let's divide our dataset into traning and testing set.

In [77]:
#Let's separate the predicted and predictor variables
X = telecom_filtered.drop(['churn'], axis=1)
y = telecom_filtered['churn']
In [78]:
# Let us scale our dataset as the units are very varying
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
In [79]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
Shape of X_train: (21000, 172)
Shape of X_test: (9001, 172)

Model Building and Evaluation

In this step we would create multiple models and try to find the best model among them. Also, we would use PCA for dimensionality reduction

In [228]:
from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn import tree

PCA

In [113]:
(telecom_filtered.isnull().sum()*100/telecom_filtered.shape[0]).sort_values(ascending=False)
Out[113]:
vol_3g_mb_diff           0.0
spl_og_mou_8             0.0
loc_ic_t2t_mou_7         0.0
loc_ic_t2t_mou_6         0.0
total_og_mou_8           0.0
total_og_mou_7           0.0
total_og_mou_6           0.0
og_others_8              0.0
og_others_7              0.0
og_others_6              0.0
spl_og_mou_7             0.0
loc_ic_t2m_mou_6         0.0
spl_og_mou_6             0.0
isd_og_mou_8             0.0
isd_og_mou_7             0.0
isd_og_mou_6             0.0
std_og_mou_8             0.0
std_og_mou_7             0.0
std_og_mou_6             0.0
std_og_t2c_mou_8         0.0
loc_ic_t2t_mou_8         0.0
loc_ic_t2m_mou_7         0.0
std_ic_mou_7             0.0
std_ic_t2m_mou_6         0.0
std_ic_t2o_mou_8         0.0
std_ic_t2o_mou_7         0.0
std_ic_t2o_mou_6         0.0
std_ic_t2f_mou_8         0.0
std_ic_t2f_mou_7         0.0
std_ic_t2f_mou_6         0.0
std_ic_t2m_mou_8         0.0
std_ic_t2m_mou_7         0.0
std_ic_t2t_mou_8         0.0
loc_ic_t2m_mou_8         0.0
std_ic_t2t_mou_7         0.0
std_ic_t2t_mou_6         0.0
loc_ic_mou_8             0.0
loc_ic_mou_7             0.0
loc_ic_mou_6             0.0
loc_ic_t2f_mou_8         0.0
loc_ic_t2f_mou_7         0.0
loc_ic_t2f_mou_6         0.0
std_og_t2c_mou_7         0.0
std_og_t2c_mou_6         0.0
std_og_t2f_mou_8         0.0
offnet_mou_7             0.0
loc_og_t2t_mou_6         0.0
roam_og_mou_8            0.0
roam_og_mou_7            0.0
roam_og_mou_6            0.0
roam_ic_mou_8            0.0
roam_ic_mou_7            0.0
roam_ic_mou_6            0.0
offnet_mou_8             0.0
offnet_mou_6             0.0
std_og_t2f_mou_7         0.0
onnet_mou_8              0.0
onnet_mou_7              0.0
onnet_mou_6              0.0
arpu_8                   0.0
arpu_7                   0.0
arpu_6                   0.0
loc_ic_t2o_mou           0.0
std_og_t2o_mou           0.0
loc_og_t2t_mou_7         0.0
loc_og_t2t_mou_8         0.0
loc_og_t2m_mou_6         0.0
loc_og_t2m_mou_7         0.0
std_og_t2f_mou_6         0.0
std_og_t2m_mou_8         0.0
std_og_t2m_mou_7         0.0
std_og_t2m_mou_6         0.0
std_og_t2t_mou_8         0.0
std_og_t2t_mou_7         0.0
std_og_t2t_mou_6         0.0
loc_og_mou_8             0.0
loc_og_mou_7             0.0
loc_og_mou_6             0.0
loc_og_t2c_mou_8         0.0
loc_og_t2c_mou_7         0.0
loc_og_t2c_mou_6         0.0
loc_og_t2f_mou_8         0.0
loc_og_t2f_mou_7         0.0
loc_og_t2f_mou_6         0.0
loc_og_t2m_mou_8         0.0
std_ic_mou_6             0.0
std_ic_mou_8             0.0
vol_2g_mb_diff           0.0
sachet_3g_8              0.0
sep_vbc_3g               0.0
jun_vbc_3g               0.0
jul_vbc_3g               0.0
aug_vbc_3g               0.0
aon                      0.0
fb_user_8                0.0
fb_user_7                0.0
fb_user_6                0.0
sachet_3g_7              0.0
arpu_diff                0.0
sachet_3g_6              0.0
monthly_3g_8             0.0
monthly_3g_7             0.0
monthly_3g_6             0.0
sachet_2g_8              0.0
sachet_2g_7              0.0
sachet_2g_6              0.0
monthly_2g_8             0.0
churn                    0.0
onnet_mou_diff           0.0
total_ic_mou_6           0.0
isd_ic_mou_diff          0.0
av_rech_amt_data_diff    0.0
max_rech_data_diff       0.0
total_rech_data_diff     0.0
max_rech_amt_diff        0.0
total_rech_amt_diff      0.0
total_rech_num_diff      0.0
total_ic_mou_diff        0.0
spl_ic_mou_diff          0.0
std_ic_mou_diff          0.0
offnet_mou_diff          0.0
loc_ic_mou_diff          0.0
total_og_mou_diff        0.0
spl_og_mou_diff          0.0
isd_og_mou_diff          0.0
std_og_mou_diff          0.0
loc_og_mou_diff          0.0
roam_og_mou_diff         0.0
roam_ic_mou_diff         0.0
monthly_2g_7             0.0
monthly_2g_6             0.0
night_pck_user_8         0.0
ic_others_7              0.0
max_rech_amt_6           0.0
total_rech_amt_8         0.0
total_rech_amt_7         0.0
total_rech_amt_6         0.0
total_rech_num_8         0.0
total_rech_num_7         0.0
total_rech_num_6         0.0
ic_others_8              0.0
ic_others_6              0.0
night_pck_user_7         0.0
isd_ic_mou_8             0.0
isd_ic_mou_7             0.0
isd_ic_mou_6             0.0
spl_ic_mou_8             0.0
spl_ic_mou_7             0.0
spl_ic_mou_6             0.0
total_ic_mou_8           0.0
total_ic_mou_7           0.0
max_rech_amt_7           0.0
max_rech_amt_8           0.0
last_day_rch_amt_6       0.0
last_day_rch_amt_7       0.0
night_pck_user_6         0.0
vol_3g_mb_8              0.0
vol_3g_mb_7              0.0
vol_3g_mb_6              0.0
vol_2g_mb_8              0.0
vol_2g_mb_7              0.0
vol_2g_mb_6              0.0
av_rech_amt_data_8       0.0
av_rech_amt_data_7       0.0
av_rech_amt_data_6       0.0
max_rech_data_8          0.0
max_rech_data_7          0.0
max_rech_data_6          0.0
total_rech_data_8        0.0
total_rech_data_7        0.0
total_rech_data_6        0.0
last_day_rch_amt_8       0.0
loc_og_t2o_mou           0.0
dtype: float64
In [114]:
#Getting PCA object
pca = Pipeline([('scaler', StandardScaler()), ('pca', PCA())])
In [115]:
pca.fit(X_train)
churn_pca = pca.fit_transform(X_train)
In [116]:
# extract pca model from pipeline
pca = pca.named_steps['pca']

# look at explainded variance of PCA components
print(pd.Series(np.round(pca.explained_variance_ratio_.cumsum(), 4)*100))
0       11.57
1       20.96
2       26.76
3       31.68
4       35.78
5       39.46
6       42.21
7       44.79
8       47.10
9       49.23
10      51.24
11      53.16
12      54.97
13      56.68
14      58.28
15      59.74
16      61.11
17      62.45
18      63.78
19      65.08
20      66.28
21      67.46
22      68.57
23      69.65
24      70.68
25      71.65
26      72.61
27      73.52
28      74.42
29      75.29
30      76.16
31      76.98
32      77.79
33      78.57
34      79.32
35      80.05
36      80.77
37      81.48
38      82.14
39      82.78
40      83.41
41      83.98
42      84.52
43      85.05
44      85.58
45      86.09
46      86.58
47      87.06
48      87.52
49      87.96
50      88.37
51      88.75
52      89.12
53      89.48
54      89.84
55      90.18
56      90.52
57      90.85
58      91.17
59      91.49
60      91.79
61      92.09
62      92.38
63      92.66
64      92.93
65      93.20
66      93.46
67      93.71
68      93.95
69      94.19
70      94.41
71      94.63
72      94.83
73      95.04
74      95.23
75      95.42
76      95.60
77      95.79
78      95.97
79      96.14
80      96.31
81      96.48
82      96.64
83      96.80
84      96.95
85      97.10
86      97.25
87      97.38
88      97.52
89      97.65
90      97.77
91      97.89
92      98.01
93      98.13
94      98.22
95      98.32
96      98.41
97      98.49
98      98.58
99      98.65
100     98.72
101     98.79
102     98.86
103     98.91
104     98.97
105     99.02
106     99.08
107     99.13
108     99.17
109     99.22
110     99.26
111     99.30
112     99.34
113     99.38
114     99.42
115     99.45
116     99.48
117     99.51
118     99.54
119     99.56
120     99.58
121     99.61
122     99.63
123     99.65
124     99.67
125     99.69
126     99.71
127     99.73
128     99.74
129     99.76
130     99.77
131     99.79
132     99.80
133     99.82
134     99.83
135     99.84
136     99.85
137     99.87
138     99.88
139     99.89
140     99.90
141     99.91
142     99.91
143     99.92
144     99.93
145     99.94
146     99.94
147     99.95
148     99.95
149     99.96
150     99.96
151     99.97
152     99.97
153     99.98
154     99.98
155     99.99
156     99.99
157     99.99
158     99.99
159    100.00
160    100.00
161    100.00
162    100.00
163    100.00
164    100.00
165    100.00
166    100.00
167    100.00
168    100.00
169    100.00
170    100.00
171    100.00
dtype: float64

We can see that 55 Principal Components are able to explain ~90% of the variance. Similarly, 75 Principal Components are able to explain ~95% of the variance.

Let's visualize this using the Scree plot

In [117]:
# Scree plot
features = range(pca.n_components_)
cumulative_variance = np.round(np.cumsum(pca.explained_variance_ratio_)*100, decimals=4)
plt.figure(figsize=[10,6])
plt.plot(cumulative_variance)
Out[117]:
[<matplotlib.lines.Line2D at 0xb0dbf30>]

Let us finalize the PC components to 55 and try it with various models such as Logistic Regression, Decision Trees etc.

PCA and Logistic Regression

Let us create a pipeline for performing the PCA followed by logistic regression.

In [118]:
# create pipeline
PCA_VARS = 55
steps = [('scaler', StandardScaler()),
         ("pca", PCA(n_components=PCA_VARS)),
         ("logistic", LogisticRegression(class_weight='balanced'))
        ]
pipeline = Pipeline(steps)
In [119]:
# fit model
pipeline.fit(X_train, y_train)

# check score on train data
pipeline.score(X_train, y_train)
Out[119]:
0.8190952380952381
In [120]:
def printScores(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    print(cm)

    sensitivity = cm[0,0]/(cm[0,0]+cm[0,1])
    print('Sensitivity : ', sensitivity )

    specificity = cm[1,1]/(cm[1,0]+cm[1,1])
    print('Specificity : ', specificity)
In [121]:
# predict churn on test data
y_pred = pipeline.predict(X_test)

printScores(y_test, y_pred)

# check area under curve
y_pred_prob = pipeline.predict_proba(X_test)[:, 1]
print("AUC:", round(roc_auc_score(y_test, y_pred_prob),2))
[[6728 1544]
 [ 118  611]]
Sensitivity :  0.8133462282398453
Specificity :  0.8381344307270233
AUC: 0.89

Note

As it is mentioned in the problem statement we need to identify the churns correctly, that means recall/sensitivity is what we want to maximize, as a result we need to have high AUC value for that.

Hyperparameter tuning - PCA and Logistic Regression

In [122]:
# class imbalance
y_train.value_counts()/y_train.shape
Out[122]:
0    0.918476
1    0.081524
Name: churn, dtype: float64
In [216]:
# PCA
pca = PCA()

# logistic regression - the class weight is used to handle class imbalance - it adjusts the cost function
logistic = LogisticRegression(class_weight={0:0.06, 1: 0.94})

# create pipeline
steps = [("scaler", StandardScaler()), 
         ("pca", pca),
         ("logistic", logistic)
        ]

# compile pipeline
pca_logistic = Pipeline(steps)

# hyperparameter space
params = {'pca__n_components': [55, 75], 'logistic__C': [0.1, 0.5, 1, 2, 3, 4, 5, 10], 'logistic__penalty': ['l1', 'l2']}

# create 5 folds
folds = 5

# create gridsearch object
model = GridSearchCV(estimator=pca_logistic, cv=folds, param_grid=params, scoring='recall', return_train_score=True, n_jobs=-1, verbose=1)
In [217]:
# fit model
model.fit(X_train, y_train)
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  4.5min finished
Out[217]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('pca',
                                        PCA(copy=True, iterated_power='auto',
                                            n_components=None,
                                            random_state=None,
                                            svd_solver='auto', tol=0.0,
                                            whiten=False)),
                                       ('logistic',
                                        LogisticRegression(C=1.0,
                                                           class_weight={0: 0.06,
                                                                         1: 0.94},
                                                           dual=False...
                                                           max_iter=100,
                                                           multi_class='warn',
                                                           n_jobs=None,
                                                           penalty='l2',
                                                           random_state=None,
                                                           solver='warn',
                                                           tol=0.0001,
                                                           verbose=0,
                                                           warm_start=False))],
                                verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'logistic__C': [0.1, 0.5, 1, 2, 3, 4, 5, 10],
                         'logistic__penalty': ['l1', 'l2'],
                         'pca__n_components': [55, 75]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='recall', verbose=1)
In [218]:
# cross validation results
pd.DataFrame(model.cv_results_)
Out[218]:
mean_fit_time std_fit_time mean_score_time std_score_time param_logistic__C param_logistic__penalty param_pca__n_components params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
0 4.444029 0.089766 0.103981 0.011539 0.1 l1 55 {'logistic__C': 0.1, 'logistic__penalty': 'l1'... 0.874636 0.860058 0.865497 0.894737 0.868421 0.872668 0.011994 18 0.883126 0.878013 0.881022 0.881752 0.883212 0.881425 0.001897
1 5.419057 0.348640 0.104986 0.005763 0.1 l1 75 {'logistic__C': 0.1, 'logistic__penalty': 'l1'... 0.886297 0.871720 0.868421 0.888889 0.862573 0.875580 0.010269 5 0.888970 0.886048 0.884672 0.887591 0.886131 0.886682 0.001470
2 4.602598 0.177276 0.103184 0.005075 0.1 l2 55 {'logistic__C': 0.1, 'logistic__penalty': 'l2'... 0.886297 0.865889 0.865497 0.894737 0.859649 0.874414 0.013596 13 0.881665 0.878744 0.877372 0.883942 0.878102 0.879965 0.002465
3 6.377885 0.217418 0.104986 0.012425 0.1 l2 75 {'logistic__C': 0.1, 'logistic__penalty': 'l2'... 0.883382 0.868805 0.871345 0.883041 0.874269 0.876168 0.006006 1 0.886779 0.886779 0.883212 0.886861 0.884672 0.885660 0.001478
4 4.247857 0.055012 0.100784 0.010491 0.5 l1 55 {'logistic__C': 0.5, 'logistic__penalty': 'l1'... 0.877551 0.854227 0.859649 0.883041 0.856725 0.866238 0.011734 28 0.878013 0.880935 0.874453 0.883212 0.879562 0.879235 0.002939
5 5.566230 0.066752 0.107381 0.015449 0.5 l1 75 {'logistic__C': 0.5, 'logistic__penalty': 'l1'... 0.877551 0.862974 0.871345 0.883041 0.874269 0.873835 0.006677 14 0.882396 0.885318 0.880292 0.883212 0.881752 0.882594 0.001664
6 4.667984 0.134979 0.095785 0.009556 0.5 l2 55 {'logistic__C': 0.5, 'logistic__penalty': 'l2'... 0.871720 0.851312 0.865497 0.883041 0.859649 0.866243 0.010762 26 0.875091 0.879474 0.875182 0.881022 0.877372 0.877628 0.002341
7 6.881599 0.205480 0.112385 0.007914 0.5 l2 75 {'logistic__C': 0.5, 'logistic__penalty': 'l2'... 0.880466 0.868805 0.871345 0.885965 0.868421 0.875000 0.007002 9 0.884587 0.883126 0.883212 0.884672 0.881752 0.883470 0.001080
8 4.307447 0.120407 0.093384 0.006497 1 l1 55 {'logistic__C': 1, 'logistic__penalty': 'l1', ... 0.874636 0.860058 0.856725 0.885965 0.856725 0.866821 0.011637 24 0.876552 0.878013 0.872993 0.884672 0.875912 0.877628 0.003882
9 6.050749 0.401549 0.104981 0.017580 1 l1 75 {'logistic__C': 1, 'logistic__penalty': 'l1', ... 0.880466 0.865889 0.871345 0.883041 0.874269 0.875002 0.006188 7 0.884587 0.884587 0.883212 0.882482 0.881022 0.883178 0.001350
10 4.939336 0.329631 0.083388 0.005351 1 l2 55 {'logistic__C': 1, 'logistic__penalty': 'l2', ... 0.874636 0.854227 0.862573 0.883041 0.859649 0.866825 0.010509 22 0.876552 0.879474 0.872993 0.878832 0.877372 0.877045 0.002274
11 7.657663 0.239069 0.109982 0.010915 1 l2 75 {'logistic__C': 1, 'logistic__penalty': 'l2', ... 0.877551 0.868805 0.871345 0.880117 0.874269 0.874417 0.004081 12 0.883126 0.885318 0.881022 0.883942 0.879562 0.882594 0.002058
12 4.456222 0.187478 0.094986 0.011677 2 l1 55 {'logistic__C': 2, 'logistic__penalty': 'l1', ... 0.883382 0.860058 0.862573 0.885965 0.856725 0.869741 0.012360 20 0.878744 0.875822 0.871533 0.881752 0.878102 0.877190 0.003404
13 6.347693 0.785021 0.107981 0.001550 2 l1 75 {'logistic__C': 2, 'logistic__penalty': 'l1', ... 0.874636 0.862974 0.871345 0.883041 0.874269 0.873252 0.006451 16 0.885318 0.885318 0.881022 0.883942 0.881752 0.883470 0.001789
14 4.819359 0.222359 0.094387 0.008729 2 l2 55 {'logistic__C': 2, 'logistic__penalty': 'l2', ... 0.871720 0.848397 0.868421 0.888889 0.850877 0.865660 0.014835 29 0.876552 0.878013 0.875912 0.881022 0.874453 0.877190 0.002231
15 7.553880 0.343301 0.116182 0.010377 2 l2 75 {'logistic__C': 2, 'logistic__penalty': 'l2', ... 0.880466 0.862974 0.871345 0.883041 0.880117 0.875588 0.007443 2 0.884587 0.883126 0.883212 0.883942 0.881752 0.883324 0.000949
16 4.638988 0.518731 0.088386 0.005425 3 l1 55 {'logistic__C': 3, 'logistic__penalty': 'l1', ... 0.880466 0.860058 0.856725 0.883041 0.853801 0.866819 0.012381 25 0.878013 0.878013 0.871533 0.879562 0.878102 0.877045 0.002818
17 7.695258 0.867391 0.109987 0.007069 3 l1 75 {'logistic__C': 3, 'logistic__penalty': 'l1', ... 0.877551 0.865889 0.871345 0.885965 0.877193 0.875588 0.006726 2 0.884587 0.883857 0.882482 0.881752 0.881022 0.882740 0.001316
18 4.820359 0.135156 0.095184 0.009000 3 l2 55 {'logistic__C': 3, 'logistic__penalty': 'l2', ... 0.868805 0.851312 0.865497 0.885965 0.856725 0.865660 0.011898 30 0.875091 0.878013 0.875182 0.882482 0.876642 0.877482 0.002720
19 8.074193 0.300637 0.119179 0.013105 3 l2 75 {'logistic__C': 3, 'logistic__penalty': 'l2', ... 0.877551 0.868805 0.868421 0.883041 0.877193 0.875001 0.005615 8 0.882396 0.880935 0.882482 0.883942 0.881022 0.882155 0.001107
20 4.455024 0.072002 0.098383 0.005084 4 l1 55 {'logistic__C': 4, 'logistic__penalty': 'l1', ... 0.880466 0.854227 0.865497 0.885965 0.856725 0.868576 0.012645 21 0.878013 0.878744 0.872993 0.877372 0.875182 0.876461 0.002103
21 9.934267 3.610927 0.098785 0.007136 4 l1 75 {'logistic__C': 4, 'logistic__penalty': 'l1', ... 0.874636 0.865889 0.871345 0.883041 0.874269 0.873835 0.005566 14 0.884587 0.885318 0.881752 0.883212 0.880292 0.883032 0.001834
22 4.916343 0.140300 0.104582 0.006886 4 l2 55 {'logistic__C': 4, 'logistic__penalty': 'l2', ... 0.874636 0.854227 0.856725 0.888889 0.853801 0.865655 0.013937 32 0.877283 0.878013 0.877372 0.880292 0.875182 0.877629 0.001638
23 7.959014 0.227159 0.101382 0.007392 4 l2 75 {'logistic__C': 4, 'logistic__penalty': 'l2', ... 0.877551 0.874636 0.871345 0.883041 0.871345 0.875583 0.004389 4 0.884587 0.883126 0.881752 0.883212 0.881022 0.882740 0.001242
24 4.461021 0.124955 0.096584 0.006712 5 l1 55 {'logistic__C': 5, 'logistic__penalty': 'l1', ... 0.883382 0.857143 0.865497 0.885965 0.859649 0.870327 0.012051 19 0.877283 0.876552 0.875912 0.880292 0.875912 0.877190 0.001631
25 8.625493 0.933486 0.101180 0.007303 5 l1 75 {'logistic__C': 5, 'logistic__penalty': 'l1', ... 0.874636 0.865889 0.868421 0.883041 0.874269 0.873250 0.005938 17 0.883126 0.885318 0.882482 0.883942 0.882482 0.883470 0.001068
26 5.109909 0.257051 0.098183 0.003969 5 l2 55 {'logistic__C': 5, 'logistic__penalty': 'l2', ... 0.874636 0.848397 0.865497 0.885965 0.856725 0.866243 0.013183 26 0.878013 0.877283 0.871533 0.879562 0.876642 0.876607 0.002717
27 7.753648 0.164861 0.105786 0.007883 5 l2 75 {'logistic__C': 5, 'logistic__penalty': 'l2', ... 0.880466 0.862974 0.871345 0.883041 0.874269 0.874418 0.007091 11 0.883126 0.884587 0.882482 0.883212 0.882482 0.883178 0.000769
28 4.530409 0.116507 0.105584 0.013919 10 l1 55 {'logistic__C': 10, 'logistic__penalty': 'l1',... 0.868805 0.854227 0.859649 0.888889 0.856725 0.865658 0.012618 31 0.874361 0.876552 0.872263 0.881022 0.877372 0.876314 0.002952
29 14.822818 3.597077 0.105584 0.003261 10 l1 75 {'logistic__C': 10, 'logistic__penalty': 'l1',... 0.877551 0.865889 0.871345 0.885965 0.874269 0.875003 0.006688 6 0.884587 0.883857 0.881752 0.885401 0.881022 0.883324 0.001671
30 4.913942 0.293457 0.102382 0.016926 10 l2 55 {'logistic__C': 10, 'logistic__penalty': 'l2',... 0.874636 0.857143 0.859649 0.885965 0.856725 0.866823 0.011615 23 0.877283 0.876552 0.871533 0.881022 0.875182 0.876314 0.003076
31 8.067990 1.058037 0.085586 0.026744 10 l2 75 {'logistic__C': 10, 'logistic__penalty': 'l2',... 0.877551 0.862974 0.871345 0.885965 0.874269 0.874420 0.007531 10 0.884587 0.882396 0.881752 0.882482 0.881022 0.882448 0.001192
In [219]:
# Final hyperparamters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)
Best AUC:  0.876168067967132
Best hyperparameters:  {'logistic__C': 0.1, 'logistic__penalty': 'l2', 'pca__n_components': 75}
In [221]:
# predict churn on test data
y_pred = model.predict(X_test)

printScores(y_test, y_pred)

# check area under curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
print("AUC:", round(roc_auc_score(y_test, y_pred_prob),2))
[[6279 1993]
 [  97  632]]
Sensitivity :  0.7590667311411993
Specificity :  0.8669410150891632
AUC: 0.89

PCA and Random Forest

Let us build predictive model using Random Forest and PCA. Although, we don't need to worry about dimensionality reduction in Random Forest but adding PCA might help in faster convergence.

In [222]:
# PCA
pca = PCA()

# Class imbalance fix
forest = RandomForestClassifier(class_weight={0:0.06, 1: 0.94})

# create pipeline
steps = [("scaler", StandardScaler()), 
         ("pca", pca),
         ("random_forest", forest)
        ]

# compile pipeline
pca_random = Pipeline(steps)

# hyperparameter space
params = {'pca__n_components': [55, 75], "random_forest__criterion": ['gini', 'entropy'], "random_forest__max_features": ['auto', 'log2', 0.4], "random_forest__min_samples_leaf" : [20, 30, 40, 50, 100]}

# create 5 folds
folds = 5

# create gridsearch object
model = GridSearchCV(estimator=pca_random, cv=folds, param_grid=params, scoring='roc_auc', return_train_score=True, n_jobs=-1, verbose=1)
In [223]:
# fit model
model.fit(X_train, y_train)
Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  8.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 15.6min finished
Out[223]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('pca',
                                        PCA(copy=True, iterated_power='auto',
                                            n_components=None,
                                            random_state=None,
                                            svd_solver='auto', tol=0.0,
                                            whiten=False)),
                                       ('random_forest',
                                        RandomForestClassifier(bootstrap=True,
                                                               class_weight={0: 0.06,...
                                                               random_state=None,
                                                               verbose=0,
                                                               warm_start=False))],
                                verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'pca__n_components': [55, 75],
                         'random_forest__criterion': ['gini', 'entropy'],
                         'random_forest__max_features': ['auto', 'log2', 0.4],
                         'random_forest__min_samples_leaf': [20, 30, 40, 50,
                                                             100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=1)
In [224]:
# Final hyperparameters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)
Best AUC:  0.8907179643429171
Best hyperparameters:  {'pca__n_components': 55, 'random_forest__criterion': 'entropy', 'random_forest__max_features': 0.4, 'random_forest__min_samples_leaf': 50}
In [225]:
# predict churn on test data
y_pred = model.predict(X_test)

printScores(y_test, y_pred)

# check area under curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
print("AUC: ", round(roc_auc_score(y_test, y_pred_prob),2))
[[7230 1042]
 [ 192  537]]
Sensitivity :  0.8740328820116054
Specificity :  0.7366255144032922
AUC:  0.88

Feature Importance

In [229]:
# predictors
features = telecom_filtered.drop('churn', axis=1).columns

# feature_importance
importance = rf_model.feature_importances_

# create dataframe
feature_importance = pd.DataFrame({'variables': features, 'importance_percentage': importance*100})
feature_importance = feature_importance[['variables', 'importance_percentage']]

# sort features
feature_importance = feature_importance.sort_values('importance_percentage', ascending=False).reset_index(drop=True)
print("Sum of importance=", feature_importance.importance_percentage.sum())
feature_importance
Sum of importance= 100.0
Out[229]:
variables importance_percentage
0 total_ic_mou_8 6.417407
1 loc_ic_mou_8 5.157730
2 loc_ic_t2m_mou_8 4.481709
3 total_og_mou_8 3.170403
4 arpu_8 2.645145
5 total_rech_amt_diff 2.643487
6 loc_og_mou_8 2.535097
7 last_day_rch_amt_8 2.363581
8 loc_og_t2m_mou_8 2.109590
9 loc_ic_t2t_mou_8 2.053483
10 max_rech_amt_8 1.982097
11 loc_og_t2t_mou_8 1.970485
12 arpu_diff 1.873198
13 roam_ic_mou_8 1.852435
14 roam_og_mou_8 1.676053
15 av_rech_amt_data_8 1.637372
16 total_rech_num_diff 1.568065
17 total_rech_amt_8 1.411694
18 std_og_mou_diff 1.370337
19 roam_og_mou_diff 1.351539
20 std_ic_mou_8 1.339294
21 total_ic_mou_diff 1.290218
22 roam_ic_mou_diff 1.152385
23 offnet_mou_diff 1.096119
24 loc_ic_mou_diff 1.060559
25 total_og_mou_diff 0.947868
26 max_rech_data_8 0.887171
27 loc_og_t2f_mou_8 0.844528
28 loc_ic_t2f_mou_8 0.791121
29 offnet_mou_8 0.745068
30 max_rech_amt_diff 0.743211
31 onnet_mou_diff 0.678629
32 vol_2g_mb_8 0.642206
33 arpu_7 0.638962
34 loc_og_mou_diff 0.620370
35 loc_ic_mou_7 0.598560
36 std_ic_t2m_mou_8 0.592085
37 vol_2g_mb_diff 0.586331
38 total_ic_mou_7 0.572606
39 std_ic_mou_diff 0.564474
40 total_rech_num_8 0.561037
41 aon 0.549956
42 av_rech_amt_data_diff 0.548167
43 onnet_mou_8 0.546466
44 vol_3g_mb_8 0.515741
45 std_og_mou_7 0.500509
46 std_og_mou_8 0.498686
47 total_rech_data_8 0.498242
48 loc_ic_mou_6 0.496588
49 total_rech_amt_6 0.496099
50 loc_ic_t2m_mou_6 0.494406
51 arpu_6 0.477663
52 total_ic_mou_6 0.473295
53 total_og_mou_7 0.472316
54 night_pck_user_8 0.472256
55 total_rech_amt_7 0.468936
56 loc_ic_t2t_mou_7 0.461880
57 loc_og_t2m_mou_6 0.451617
58 total_rech_num_7 0.444959
59 max_rech_data_diff 0.440663
60 fb_user_8 0.424956
61 loc_ic_t2t_mou_6 0.424547
62 loc_og_t2m_mou_7 0.414722
63 loc_og_mou_6 0.414667
64 loc_ic_t2m_mou_7 0.411894
65 loc_og_mou_7 0.411552
66 offnet_mou_6 0.407059
67 last_day_rch_amt_7 0.391939
68 max_rech_amt_6 0.387450
69 std_og_t2m_mou_8 0.385282
70 total_rech_num_6 0.380143
71 offnet_mou_7 0.377669
72 spl_og_mou_diff 0.374110
73 std_ic_mou_7 0.368057
74 onnet_mou_6 0.367927
75 roam_og_mou_7 0.366732
76 loc_og_t2t_mou_7 0.362130
77 std_og_mou_6 0.349199
78 roam_ic_mou_7 0.348569
79 onnet_mou_7 0.348084
80 total_og_mou_6 0.346186
81 vol_3g_mb_diff 0.341262
82 loc_og_t2t_mou_6 0.330804
83 spl_ic_mou_diff 0.330456
84 loc_ic_t2f_mou_7 0.329535
85 max_rech_amt_7 0.329359
86 std_ic_mou_6 0.322964
87 std_og_t2m_mou_7 0.319578
88 last_day_rch_amt_6 0.312396
89 loc_ic_t2f_mou_6 0.310905
90 std_ic_t2m_mou_6 0.309676
91 aug_vbc_3g 0.307845
92 std_og_t2t_mou_7 0.307567
93 std_ic_t2m_mou_7 0.306765
94 total_rech_data_diff 0.298487
95 std_ic_t2f_mou_8 0.290413
96 spl_og_mou_7 0.288775
97 isd_ic_mou_diff 0.288285
98 spl_og_mou_8 0.284547
99 av_rech_amt_data_7 0.283542
100 vol_2g_mb_6 0.283265
101 std_og_t2m_mou_6 0.278569
102 loc_og_t2f_mou_7 0.276642
103 std_og_t2t_mou_6 0.272343
104 loc_og_t2f_mou_6 0.265883
105 vol_3g_mb_7 0.261102
106 std_og_t2t_mou_8 0.259132
107 spl_og_mou_6 0.254940
108 av_rech_amt_data_6 0.251308
109 vol_2g_mb_7 0.249721
110 std_ic_t2t_mou_8 0.242389
111 std_ic_t2t_mou_7 0.232519
112 std_ic_t2t_mou_6 0.231937
113 jun_vbc_3g 0.224690
114 ic_others_8 0.221020
115 max_rech_data_6 0.219881
116 vol_3g_mb_6 0.219801
117 max_rech_data_7 0.212149
118 jul_vbc_3g 0.211226
119 sachet_2g_8 0.190806
120 isd_og_mou_diff 0.186321
121 isd_ic_mou_6 0.184548
122 sep_vbc_3g 0.182737
123 ic_others_7 0.177509
124 total_rech_data_7 0.176537
125 roam_og_mou_6 0.171920
126 spl_ic_mou_6 0.169185
127 roam_ic_mou_6 0.165096
128 og_others_6 0.157593
129 spl_ic_mou_8 0.156859
130 loc_og_t2c_mou_7 0.155250
131 isd_ic_mou_7 0.154417
132 loc_og_t2c_mou_6 0.153892
133 ic_others_6 0.153150
134 loc_og_t2c_mou_8 0.142169
135 isd_ic_mou_8 0.138297
136 isd_og_mou_6 0.136114
137 std_ic_t2f_mou_7 0.135374
138 total_rech_data_6 0.133929
139 sachet_2g_7 0.128281
140 std_og_t2f_mou_7 0.127159
141 isd_og_mou_7 0.127066
142 std_ic_t2f_mou_6 0.124170
143 sachet_2g_6 0.102225
144 isd_og_mou_8 0.101681
145 std_og_t2f_mou_6 0.100328
146 sachet_3g_7 0.071989
147 monthly_2g_6 0.067700
148 spl_ic_mou_7 0.063756
149 monthly_3g_7 0.062782
150 monthly_3g_6 0.060441
151 std_og_t2f_mou_8 0.055535
152 sachet_3g_8 0.053851
153 sachet_3g_6 0.051878
154 fb_user_7 0.050786
155 night_pck_user_7 0.045649
156 monthly_2g_7 0.045581
157 monthly_3g_8 0.043118
158 fb_user_6 0.042615
159 night_pck_user_6 0.037823
160 monthly_2g_8 0.035383
161 og_others_7 0.013123
162 og_others_8 0.008676
163 loc_ic_t2o_mou 0.000000
164 std_ic_t2o_mou_7 0.000000
165 std_og_t2c_mou_6 0.000000
166 std_og_t2c_mou_7 0.000000
167 std_og_t2c_mou_8 0.000000
168 std_ic_t2o_mou_8 0.000000
169 std_og_t2o_mou 0.000000
170 std_ic_t2o_mou_6 0.000000
171 loc_og_t2o_mou 0.000000

Logistic Regression

In [242]:
learner_pca = LogisticRegression(class_weight='balanced')
learner_pca.fit(X_train,y_train)
Out[242]:
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
In [243]:
#Predict on training set
dtrain_predictions = learner_pca.predict(X_train)
dtrain_predprob = learner_pca.predict_proba(X_train)[:,1]
In [244]:
print ("Accuracy :",metrics.roc_auc_score(y_train, dtrain_predictions))
print ("Recall/Sensitivity :",metrics.recall_score(y_train, dtrain_predictions))
print ("AUC Score (Train):",metrics.roc_auc_score(y_train, dtrain_predprob))
Accuracy : 0.8415961015904517
Recall/Sensitivity : 0.8539719626168224
AUC Score (Train): 0.9143073740585401

PCA and Decision Tree

Let us build predictive model using Decision Tree and PCA.

In [262]:
# PCA
pca = PCA()

# Class imbalance fix
dc = DecisionTreeClassifier(class_weight={0:0.06, 1: 0.94})

# create pipeline
steps = [("scaler", StandardScaler()), 
         ("pca", pca),
         ("dc", dc)
        ]

# compile pipeline
pca_dc = Pipeline(steps)

# hyperparameter space
params = {'pca__n_components': [55, 75], "dc__criterion": ['gini', 'entropy'], "dc__max_features": ['auto', 'log2', 0.4], "dc__max_depth": [2, 3, 4], "dc__min_samples_leaf" : [20, 30, 40, 50, 100]}

# create 5 folds
folds = 5

# create gridsearch object
model = GridSearchCV(estimator=pca_dc, cv=folds, param_grid=params, scoring='roc_auc', return_train_score=True, n_jobs=-1, verbose=1)
In [263]:
# fit model
model.fit(X_train, y_train)
Fitting 5 folds for each of 180 candidates, totalling 900 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   56.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  9.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 16.0min
[Parallel(n_jobs=-1)]: Done 900 out of 900 | elapsed: 18.3min finished
Out[263]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('pca',
                                        PCA(copy=True, iterated_power='auto',
                                            n_components=None,
                                            random_state=None,
                                            svd_solver='auto', tol=0.0,
                                            whiten=False)),
                                       ('dc',
                                        DecisionTreeClassifier(class_weight={0: 0.06,
                                                                             1: 0.94},
                                                               criterion='gini',
                                                               m...
                                                               presort=False,
                                                               random_state=None,
                                                               splitter='best'))],
                                verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'dc__criterion': ['gini', 'entropy'],
                         'dc__max_depth': [2, 3, 4],
                         'dc__max_features': ['auto', 'log2', 0.4],
                         'dc__min_samples_leaf': [20, 30, 40, 50, 100],
                         'pca__n_components': [55, 75]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=1)
In [264]:
# Final hyperparameters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)
Best AUC:  0.8413191545810401
Best hyperparameters:  {'dc__criterion': 'entropy', 'dc__max_depth': 4, 'dc__max_features': 0.4, 'dc__min_samples_leaf': 100, 'pca__n_components': 55}

Although, we got the most optimized at max_depth at 4, but it can get complex and might overfit. Let's try with 3.

In [266]:
# predict churn on test data
dc = DecisionTreeClassifier(class_weight={0:0.06, 1: 0.94}, criterion='entropy', max_depth=3, min_samples_leaf=50)
dc.fit(X_train, y_train)
y_pred = dc.predict(X_test)

printScores(y_test, y_pred)

# check area under curve
y_pred_prob = dc.predict_proba(X_test)[:, 1]
print("AUC: ", round(roc_auc_score(y_test, y_pred_prob),2))
[[6552 1720]
 [ 144  585]]
Sensitivity :  0.7920696324951644
Specificity :  0.8024691358024691
AUC:  0.88

Summary

PCA with Logistic Regression
Sensitivity : 0.7590667311411993
Specificity : 0.8669410150891632
AUC: 0.89

PCA with Random Forest
Sensitivity : 0.8740328820116054
Specificity : 0.7366255144032922
AUC: 0.88

PCA with Decision Classifier
Sensitivity : 0.7920696324951644
Specificity : 0.8024691358024691
AUC: 0.88

Logistic Regression
Accuracy : 0.8415961015904517
Recall/Sensitivity : 0.8539719626168224

Overall, the PCA with Random Forest is the best for the above scenario as it has the best sensitivity and comparable accuracy for the given business objective.

Identify Best Predictors

Let us now design a model to get the most important predictors for churn problem and let us visualize them. We would use decision tree for identifying the most important predictors as it is very easy to identify the most important predictors by visualizing the decision tree.

Decision Tree

In [270]:
# Class imbalance fix
dc = DecisionTreeClassifier(class_weight={0:0.06, 1: 0.94}, max_depth=3, max_features='auto', min_samples_leaf=50)

# create pipeline
steps = [("scaler", StandardScaler()), 
         ("dc", dc)]

# compile pipeline
pipeline = Pipeline(steps)
In [271]:
# fit model
pipeline.fit(X_train, y_train)

# check score on train data
pipeline.score(X_train, y_train)
Out[271]:
0.8815714285714286
In [272]:
# predict churn on test data
y_pred = pipeline.predict(X_test)

printScores(y_test, y_pred)

# check area under curve
y_pred_prob = pipeline.predict_proba(X_test)[:, 1]
print("AUC:", round(roc_auc_score(y_test, y_pred_prob),2))
[[7416  856]
 [ 213  516]]
Sensitivity :  0.8965183752417795
Specificity :  0.7078189300411523
AUC: 0.85

Hyper-parameter Tuning for Decision Tree

In [273]:
# Class imbalance fix
dc = DecisionTreeClassifier(class_weight={0:0.06, 1: 0.94})

# create pipeline
steps = [("scaler", StandardScaler()),
         ("dc", dc)
        ]

# compile pipeline
pipeline_dc = Pipeline(steps)

# hyperparameter space
params = {"dc__criterion": ['gini', 'entropy'], "dc__max_features": ['auto', 'log2', 0.4], "dc__max_depth": [2, 3, 4], "dc__min_samples_leaf" : [20, 30, 40, 50, 100]}

# create 5 folds
folds = 5

# create gridsearch object
model = GridSearchCV(estimator=pipeline_dc, cv=folds, param_grid=params, scoring='roc_auc', return_train_score=True, n_jobs=-1, verbose=1)
In [274]:
# fit model
model.fit(X_train, y_train)
Fitting 5 folds for each of 90 candidates, totalling 450 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:  2.8min finished
Out[274]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('dc',
                                        DecisionTreeClassifier(class_weight={0: 0.06,
                                                                             1: 0.94},
                                                               criterion='gini',
                                                               max_depth=None,
                                                               max_features=None,
                                                               max_leaf_nodes=None,
                                                               min_impurity_decrease=0.0,
                                                               min_impurity_split=None,
                                                               min_samples_leaf=1...
                                                               min_weight_fraction_leaf=0.0,
                                                               presort=False,
                                                               random_state=None,
                                                               splitter='best'))],
                                verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'dc__criterion': ['gini', 'entropy'],
                         'dc__max_depth': [2, 3, 4],
                         'dc__max_features': ['auto', 'log2', 0.4],
                         'dc__min_samples_leaf': [20, 30, 40, 50, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=1)
In [275]:
# Final hyperparameters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)
Best AUC:  0.8941532916129352
Best hyperparameters:  {'dc__criterion': 'gini', 'dc__max_depth': 4, 'dc__max_features': 0.4, 'dc__min_samples_leaf': 50}
In [276]:
# predict churn on test data
dc = DecisionTreeClassifier(class_weight={0:0.06, 1: 0.94}, criterion='entropy', max_depth=4, min_samples_leaf=30, max_features=0.4)
dc.fit(X_train, y_train)
y_pred = dc.predict(X_test)

printScores(y_test, y_pred)

# check area under curve
y_pred_prob = dc.predict_proba(X_test)[:, 1]
print("AUC: ", round(roc_auc_score(y_test, y_pred_prob),2))
[[6707 1565]
 [ 132  597]]
Sensitivity :  0.8108075435203095
Specificity :  0.8189300411522634
AUC:  0.89

Let's visualize the final tree

For the ease and to prevent any errors, the following code is commented, but if the graphviz is installed and the path is specified, then the following code can be run. For ease just attaching the output of this code as the form of image.

In [277]:
# # Importing required packages for visualization
# from IPython.display import Image  
# from sklearn.externals.six import StringIO  
# from sklearn.tree import export_graphviz
# import pydotplus, graphviz

# # Putting features
# features = list(telecom_filtered.columns)
# features.remove('churn')
In [278]:
# # If you're on windows:
# # Specifing path for dot file.
# import os
# os.environ["PATH"] += os.pathsep + 'C:\Program Files (x86)\Graphviz2.38\bin'
In [279]:
# # plotting tree with max_depth=4
# dot_data = StringIO()  
# export_graphviz(dc, out_file=dot_data,
#                 feature_names=features, filled=True,rounded=True)

# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# graph.write_pdf("dt_churn.pdf")

image.png

Conclusions

In decision tree, the overall entropy at each node is decreased, based on which feature importance is calculated as we have used entropy in calculating DT, so starting from the root shows the decresing importance of each feature.

  1. loc_ic_mou_8, roam_og_mou_8 & vol_3g_mb_8 are some of the most important features.
  2. Local Incoming and roaming Outgoing Calls for 8th Month and volume of data used in 8th Month are the most important columns to predict churn.
  3. Total Recharge Amount is a strong feature to predict churn.
  4. Random Forest produced the best prediction results followed by Decision Tree.

The company should focus on the above parameters in the action phase can help company to provide offers to such customers to retain the probable churn customers. They can provide offers related to the data usage and roaming outgoing call plans so that they improve the usage and the customers do not leave the company.

The company needs to focus on the STD and ISD rates. Perhaps, the rates are too high. Provide them with some kind of STD and ISD packages.

To look into both of the issues stated above, it is desired that the telecom company collects customer query and complaint data and work on their services according to the needs of customers.